From jonmon at mcs.anl.gov Thu Sep 1 13:07:53 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 1 Sep 2011 13:07:53 -0500 Subject: [Swift-devel] Swift is hanging Message-ID: <3BC569E3-D8B2-490F-9089-D3A2E1FC93B2@mcs.anl.gov> Hello, I tried running the Swift 0.93 candidate. Swift started to hang after 27 tasks. The log is located at http://www.ci.uchicago.edu/~jonmon/logs/montage-3.log. This run was executed using coasters. The coaster log is at http://www.ci.uchicago.edu/~jonmon/logs/coasters.log. Progress: time: Thu, 01 Sep 2011 12:54:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:54:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:55:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:55:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:56:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:56:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:57:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:57:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:58:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:58:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:59:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 12:59:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:00:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:00:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:01:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:01:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:02:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:02:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:03:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:03:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:04:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:04:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:05:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:05:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:06:01 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:06:31 -0500 Submitted:15 Active:8 Finished successfully:27 Progress: time: Thu, 01 Sep 2011 13:07:01 -0500 Submitted:15 Active:8 Finished successfully:27 It shows that 8 tasks are active but there were no jobs active in PADS or Beagle when I checked with showq. -------------- next part -------------- An HTML attachment was scrubbed... URL: From turam at mcs.anl.gov Thu Sep 1 13:42:27 2011 From: turam at mcs.anl.gov (Thomas Uram) Date: Thu, 1 Sep 2011 13:42:27 -0500 Subject: [Swift-devel] Release notes for 0.92? Message-ID: Maybe I've overlooked, but while I notice that earlier releases link to release notes, 0.92 does not do this: http://www.ci.uchicago.edu/swift/downloads/ Does such exist somewhere? Specifically, I'm wondering if passive coasters support is in 0.92 or if it's yet to come. Thanks, Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 1 13:43:47 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Sep 2011 11:43:47 -0700 Subject: [Swift-devel] queuedsize > 0 but no job dequeued In-Reply-To: References: <1314663369.31525.0.camel@blabla> <1314667837.919.0.camel@blabla> <1314676559.1750.0.camel@blabla> <1314734754.6888.0.camel@blabla> Message-ID: <1314902627.21196.0.camel@blabla> Is there any chance that some of your jobs run longer than their requested walltime? On Wed, 2011-08-31 at 09:04 -0500, Ketan Maheshwari wrote: > Mihael, > > > I did the run with the debug enabled on coasters. Please find the logs > etc, for this run here: > > > http://www.ci.uchicago.edu/~ketan/run25.tgz > > > > > Note that the run went well and ran upto 20k jobs without issues. > After that I did not get nodes so I stopped it and resumed it this > morning. It ran for about 1000+ jobs and crashed with the same error > message. > > > > > Regards, > Ketan > > On Tue, Aug 30, 2011 at 3:05 PM, Mihael Hategan > wrote: > Any chance you can re-run this with debug enabled on coasters > (log4j.logger.org.globus.cog.abstraction.coaster=DEBUG)? > > > On Mon, 2011-08-29 at 20:55 -0700, Mihael Hategan wrote: > > My bad. The info is in the swift log. > > > > On Mon, 2011-08-29 at 20:59 -0500, Ketan Maheshwari wrote: > > > This is on Beagle. I am running local:pbs from /lustre. > > > > > > On Mon, Aug 29, 2011 at 8:30 PM, Mihael Hategan > > > > wrote: > > > On Mon, 2011-08-29 at 19:52 -0500, Ketan > Maheshwari wrote: > > > > Mihael, > > > > > > > > > > > > This run was with automatic coasters. I do not > see any > > > specific > > > > coasters.log file written during this run > in .globus/coaster > > > nor in > > > > the run's work dir. > > > > > > > > > It's on the remote site in .globus/coasters. > > > > > > > > > > > > > > > Ketan > > > > > > > > On Mon, Aug 29, 2011 at 7:16 PM, Mihael Hategan > > > > > > > wrote: > > > > Can I have the coasters log please? > > > > > > > > > > > > On Sun, 2011-08-28 at 16:47 -0500, Ketan > Maheshwari > > > wrote: > > > > > Hello, > > > > > > > > > > > > > > > I remember this error happened in the > past with > > > Glen's and > > > > Sheri's > > > > > runs. I saw this today again on Beagle > with 0.93 > > > while > > > > running the > > > > > DSSAT run. > > > > > > > > > > > > > > > The run stops with the following > complete message: > > > > > > > > > > > > > > > queuedsize > 0 but no job dequeued. > Queued: {} > > > > > java.lang.Throwable > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > > > queuedsize > 0 but no job dequeued. > Queued: {} > > > > > java.lang.Throwable > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > > > at > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > > > Progress: time: Sun, 28 Aug 2011 > 13:34:26 -0600 > > > > Submitted:76 > > > > > Active:23 Checking status:1 > Finished > > > successfully:597 > > > > > > > > > > > > > > > > > > > > > > > > > The logs, properties and sources for > this run are: > > > > > > http://www.ci.uchicago.edu/~ketan/run23.tgz > > > > > > > > > > > > > > > Regards, > > > > > -- > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > -- > Ketan > > > From hategan at mcs.anl.gov Thu Sep 1 15:52:38 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Sep 2011 13:52:38 -0700 Subject: [Swift-devel] Swift is hanging In-Reply-To: <3BC569E3-D8B2-490F-9089-D3A2E1FC93B2@mcs.anl.gov> References: <3BC569E3-D8B2-490F-9089-D3A2E1FC93B2@mcs.anl.gov> Message-ID: <1314910358.23684.0.camel@blabla> Can you try a current svn checkout of the branch instead? On Thu, 2011-09-01 at 13:07 -0500, Jonathan Monette wrote: > Hello, > I tried running the Swift 0.93 candidate. Swift started to hang > after 27 tasks. The log is located > at http://www.ci.uchicago.edu/~jonmon/logs/montage-3.log. This run > was executed using coasters. The coaster log is > at http://www.ci.uchicago.edu/~jonmon/logs/coasters.log. > > > Progress: time: Thu, 01 Sep 2011 12:54:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:54:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:55:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:55:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:56:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:56:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:57:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:57:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:58:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:58:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:59:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:59:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:00:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:00:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:01:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:01:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:02:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:02:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:03:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:03:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:04:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:04:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:05:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:05:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:06:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:06:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:07:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > > > > > It shows that 8 tasks are active but there were no jobs active in PADS > or Beagle when I checked with showq. > > From ketancmaheshwari at gmail.com Thu Sep 1 16:16:09 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 1 Sep 2011 16:16:09 -0500 Subject: [Swift-devel] queuedsize > 0 but no job dequeued In-Reply-To: <1314902627.21196.0.camel@blabla> References: <1314663369.31525.0.camel@blabla> <1314667837.919.0.camel@blabla> <1314676559.1750.0.camel@blabla> <1314734754.6888.0.camel@blabla> <1314902627.21196.0.camel@blabla> Message-ID: Mihael, That is likely. The walltime is 20 mins and most jobs as far as I know are less than 10 mins. However, there could be outliers. These are about 120k jobs. Ketan On Thu, Sep 1, 2011 at 1:43 PM, Mihael Hategan wrote: > Is there any chance that some of your jobs run longer than their > requested walltime? > > On Wed, 2011-08-31 at 09:04 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > I did the run with the debug enabled on coasters. Please find the logs > > etc, for this run here: > > > > > > http://www.ci.uchicago.edu/~ketan/run25.tgz > > > > > > > > > > Note that the run went well and ran upto 20k jobs without issues. > > After that I did not get nodes so I stopped it and resumed it this > > morning. It ran for about 1000+ jobs and crashed with the same error > > message. > > > > > > > > > > Regards, > > Ketan > > > > On Tue, Aug 30, 2011 at 3:05 PM, Mihael Hategan > > wrote: > > Any chance you can re-run this with debug enabled on coasters > > (log4j.logger.org.globus.cog.abstraction.coaster=DEBUG)? > > > > > > On Mon, 2011-08-29 at 20:55 -0700, Mihael Hategan wrote: > > > My bad. The info is in the swift log. > > > > > > On Mon, 2011-08-29 at 20:59 -0500, Ketan Maheshwari wrote: > > > > This is on Beagle. I am running local:pbs from /lustre. > > > > > > > > On Mon, Aug 29, 2011 at 8:30 PM, Mihael Hategan > > > > > > wrote: > > > > On Mon, 2011-08-29 at 19:52 -0500, Ketan > > Maheshwari wrote: > > > > > Mihael, > > > > > > > > > > > > > > > This run was with automatic coasters. I do not > > see any > > > > specific > > > > > coasters.log file written during this run > > in .globus/coaster > > > > nor in > > > > > the run's work dir. > > > > > > > > > > > > It's on the remote site in .globus/coasters. > > > > > > > > > > > > > > > > > > > Ketan > > > > > > > > > > On Mon, Aug 29, 2011 at 7:16 PM, Mihael Hategan > > > > > > > > > wrote: > > > > > Can I have the coasters log please? > > > > > > > > > > > > > > > On Sun, 2011-08-28 at 16:47 -0500, Ketan > > Maheshwari > > > > wrote: > > > > > > Hello, > > > > > > > > > > > > > > > > > > I remember this error happened in the > > past with > > > > Glen's and > > > > > Sheri's > > > > > > runs. I saw this today again on Beagle > > with 0.93 > > > > while > > > > > running the > > > > > > DSSAT run. > > > > > > > > > > > > > > > > > > The run stops with the following > > complete message: > > > > > > > > > > > > > > > > > > queuedsize > 0 but no job dequeued. > > Queued: {} > > > > > > java.lang.Throwable > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > > > > queuedsize > 0 but no job dequeued. > > Queued: {} > > > > > > java.lang.Throwable > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > > > > Progress: time: Sun, 28 Aug 2011 > > 13:34:26 -0600 > > > > > Submitted:76 > > > > > > Active:23 Checking status:1 > > Finished > > > > successfully:597 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The logs, properties and sources for > > this run are: > > > > > > > > http://www.ci.uchicago.edu/~ketan/run23.tgz > > > > > > > > > > > > > > > > > > Regards, > > > > > > -- > > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Sep 1 17:16:47 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Thu, 01 Sep 2011 17:16:47 -0500 Subject: [Swift-devel] =?utf-8?q?Swift_is_hanging?= Message-ID: <20110901221632.4FC1412765@zimbra.anl.gov> I believe I tried the 0.93 src branch as well. Let me try it again. ----- Reply message ----- From: "Mihael Hategan" Date: Thu, Sep 1, 2011 3:52 pm Subject: Swift is hanging To: "Jonathan Monette" Cc: "swift-devel Devel" Can you try a current svn checkout of the branch instead? On Thu, 2011-09-01 at 13:07 -0500, Jonathan Monette wrote: > Hello, > I tried running the Swift 0.93 candidate. Swift started to hang > after 27 tasks. The log is located > at http://www.ci.uchicago.edu/~jonmon/logs/montage-3.log. This run > was executed using coasters. The coaster log is > at http://www.ci.uchicago.edu/~jonmon/logs/coasters.log. > > > Progress: time: Thu, 01 Sep 2011 12:54:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:54:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:55:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:55:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:56:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:56:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:57:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:57:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:58:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:58:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:59:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 12:59:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:00:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:00:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:01:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:01:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:02:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:02:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:03:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:03:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:04:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:04:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:05:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:05:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:06:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:06:31 -0500 Submitted:15 > Active:8 Finished successfully:27 > Progress: time: Thu, 01 Sep 2011 13:07:01 -0500 Submitted:15 > Active:8 Finished successfully:27 > > > > > It shows that 8 tasks are active but there were no jobs active in PADS > or Beagle when I checked with showq. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 1 17:19:00 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Sep 2011 15:19:00 -0700 Subject: [Swift-devel] Swift is hanging In-Reply-To: <20110901221632.4FC1412765@zimbra.anl.gov> References: <20110901221632.4FC1412765@zimbra.anl.gov> Message-ID: <1314915540.24230.0.camel@blabla> Make sure you do an update. I committed some stuff just before sending the email below. On Thu, 2011-09-01 at 17:16 -0500, Jonathan Monette wrote: > I believe I tried the 0.93 src branch as well. Let me try it again. > > ----- Reply message ----- > From: "Mihael Hategan" > Date: Thu, Sep 1, 2011 3:52 pm > Subject: Swift is hanging > To: "Jonathan Monette" > Cc: "swift-devel Devel" > > > Can you try a current svn checkout of the branch instead? > > On Thu, 2011-09-01 at 13:07 -0500, Jonathan Monette wrote: > > Hello, > > I tried running the Swift 0.93 candidate. Swift started to hang > > after 27 tasks. The log is located > > at http://www.ci.uchicago.edu/~jonmon/logs/montage-3.log. This run > > was executed using coasters. The coaster log is > > at http://www.ci.uchicago.edu/~jonmon/logs/coasters.log. > > > > > > Progress: time: Thu, 01 Sep 2011 12:54:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:54:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:55:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:55:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:56:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:56:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:57:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:57:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:58:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:58:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:59:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:59:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:00:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:00:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:01:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:01:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:02:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:02:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:03:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:03:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:04:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:04:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:05:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:05:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:06:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:06:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:07:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > > > > > > > > > It shows that 8 tasks are active but there were no jobs active in > PADS > > or Beagle when I checked with showq. > > > > > > > > From jonmon at mcs.anl.gov Thu Sep 1 17:30:18 2011 From: jonmon at mcs.anl.gov (=?utf-8?B?Sm9uYXRoYW4gTW9uZXR0ZQ==?=) Date: Thu, 01 Sep 2011 17:30:18 -0500 Subject: [Swift-devel] =?utf-8?q?Swift_is_hanging?= Message-ID: <20110901223003.03E2712768@zimbra.anl.gov> Ohh. Ok. I tried it with the stable branch this morning so ill update and try again. ----- Reply message ----- From: "Mihael Hategan" Date: Thu, Sep 1, 2011 5:19 pm Subject: Swift is hanging To: "Jonathan Monette" Cc: "swift-devel Devel" Make sure you do an update. I committed some stuff just before sending the email below. On Thu, 2011-09-01 at 17:16 -0500, Jonathan Monette wrote: > I believe I tried the 0.93 src branch as well. Let me try it again. > > ----- Reply message ----- > From: "Mihael Hategan" > Date: Thu, Sep 1, 2011 3:52 pm > Subject: Swift is hanging > To: "Jonathan Monette" > Cc: "swift-devel Devel" > > > Can you try a current svn checkout of the branch instead? > > On Thu, 2011-09-01 at 13:07 -0500, Jonathan Monette wrote: > > Hello, > > I tried running the Swift 0.93 candidate. Swift started to hang > > after 27 tasks. The log is located > > at http://www.ci.uchicago.edu/~jonmon/logs/montage-3.log. This run > > was executed using coasters. The coaster log is > > at http://www.ci.uchicago.edu/~jonmon/logs/coasters.log. > > > > > > Progress: time: Thu, 01 Sep 2011 12:54:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:54:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:55:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:55:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:56:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:56:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:57:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:57:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:58:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:58:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:59:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 12:59:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:00:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:00:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:01:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:01:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:02:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:02:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:03:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:03:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:04:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:04:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:05:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:05:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:06:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:06:31 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > Progress: time: Thu, 01 Sep 2011 13:07:01 -0500 Submitted:15 > > Active:8 Finished successfully:27 > > > > > > > > > > It shows that 8 tasks are active but there were no jobs active in > PADS > > or Beagle when I checked with showq. > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Sep 1 17:32:20 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 1 Sep 2011 17:32:20 -0500 Subject: [Swift-devel] 0.93 problem: Fwd: amwg-swift Message-ID: This issue reported by Sheri w the Parvis amwg script looks a lot like what Jon reported earlier today. Mihael, Justin, David, Jon: can you investigate? Jon, can you repro the problem and send in a jstack? ---------- Forwarded message ---------- From: Sheri Mickelson Date: Thu, 1 Sep 2011 16:20:21 -0500 Subject: Fwd: amwg-swift To: Michael Wilde Cc: Andrew Mai Hi Mike, Sorry to bother you with this, but if you have time within the next 1/2 hour -45 minutes can you check on a amwg-swift run Andy is trying on mirage2.ucar.edu? He lists all the directories below. It's trying to run, but all swift is doing is ... Progress: time: Thu, 01 Sep 2011 15:17:25 -0600 Initializing:450 Selecting site:12 Stage in:8 It's been repeating "Initializing:450 Selecting site:12 Stage in:8" for the last 40 minutes. He's using swift 0.93. Is this a swift concurrency issue? -Sheri Begin forwarded message: > From: Andrew Mai > Date: September 1, 2011 4:09:05 PM CDT > To: Sheri Mickelson > Subject: Re: amwg-swift > > On 09/01/2011 02:59 PM, Sheri Mickelson wrote: > >> That's not normal. > > I have to go to a CSEG meeting. I will leave it running on mirage2. > You can log in and watch it run, if you want to. Relevant directories: > > work dir : /glade/scratch/mai/swift-workdirectory/ > amwg_stats-20110901-1440-ujv4b5z8 > script dir : /glade/home/mai/AMWG_diag/swift > code base : /fis/cgd/cms/amwg/amwg_diagnostics_swift > output dir : /glade/scratch/mai/output-SO1 > > > Andy > -- Sent from my mobile device From hategan at mcs.anl.gov Thu Sep 1 17:35:10 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Sep 2011 15:35:10 -0700 Subject: [Swift-devel] 0.93 problem: Fwd: amwg-swift In-Reply-To: References: Message-ID: <1314916510.24336.0.camel@blabla> Can't tell much without a log. On Thu, 2011-09-01 at 17:32 -0500, Michael Wilde wrote: > This issue reported by Sheri w the Parvis amwg script looks a lot like > what Jon reported earlier today. Mihael, Justin, David, Jon: can you > investigate? Jon, can you repro the problem and send in a jstack? > > ---------- Forwarded message ---------- > From: Sheri Mickelson > Date: Thu, 1 Sep 2011 16:20:21 -0500 > Subject: Fwd: amwg-swift > To: Michael Wilde > Cc: Andrew Mai > > Hi Mike, > > Sorry to bother you with this, but if you have time within the next > 1/2 hour -45 minutes can you check on a amwg-swift run Andy is trying > on mirage2.ucar.edu? He lists all the directories below. It's trying > to run, but all swift is doing is ... > > Progress: time: Thu, 01 Sep 2011 15:17:25 -0600 Initializing:450 > Selecting site:12 Stage in:8 > It's been repeating "Initializing:450 Selecting site:12 Stage in:8" > for the last 40 minutes. > > He's using swift 0.93. Is this a swift concurrency issue? > > -Sheri > > Begin forwarded message: > > > From: Andrew Mai > > Date: September 1, 2011 4:09:05 PM CDT > > To: Sheri Mickelson > > Subject: Re: amwg-swift > > > > On 09/01/2011 02:59 PM, Sheri Mickelson wrote: > > > >> That's not normal. > > > > I have to go to a CSEG meeting. I will leave it running on mirage2. > > You can log in and watch it run, if you want to. Relevant directories: > > > > work dir : /glade/scratch/mai/swift-workdirectory/ > > amwg_stats-20110901-1440-ujv4b5z8 > > script dir : /glade/home/mai/AMWG_diag/swift > > code base : /fis/cgd/cms/amwg/amwg_diagnostics_swift > > output dir : /glade/scratch/mai/output-SO1 > > > > > > Andy > > > > From jonmon at mcs.anl.gov Thu Sep 1 21:51:36 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 1 Sep 2011 21:51:36 -0500 Subject: [Swift-devel] Swift is hanging In-Reply-To: <20110901223003.03E2712768@zimbra.anl.gov> References: <20110901223003.03E2712768@zimbra.anl.gov> Message-ID: Ok. I tried again with the updated 0.93 source and the workflow completed. I am trying a couple more runs to verify that this isn't just a one time completion. One thing I did notice was that it seemed to run slower than normal. I am checking to make sure that this isn't data set specific but thought I should let everyone know to keep an eye out if they see scripts running slowly as well. The last time I noticed my scrips running slowly the twice-each bug was found. On Sep 1, 2011, at 5:30 PM, Jonathan Monette wrote: > Ohh. Ok. I tried it with the stable branch this morning so ill update and try again. > > ----- Reply message ----- > From: "Mihael Hategan" > Date: Thu, Sep 1, 2011 5:19 pm > Subject: Swift is hanging > To: "Jonathan Monette" > Cc: "swift-devel Devel" > > > Make sure you do an update. I committed some stuff just before sending > the email below. > > On Thu, 2011-09-01 at 17:16 -0500, Jonathan Monette wrote: > > I believe I tried the 0.93 src branch as well. Let me try it again. > > > > ----- Reply message ----- > > From: "Mihael Hategan" > > Date: Thu, Sep 1, 2011 3:52 pm > > Subject: Swift is hanging > > To: "Jonathan Monette" > > Cc: "swift-devel Devel" > > > > > > Can you try a current svn checkout of the branch instead? > > > > On Thu, 2011-09-01 at 13:07 -0500, Jonathan Monette wrote: > > > Hello, > > > I tried running the Swift 0.93 candidate. Swift started to hang > > > after 27 tasks. The log is located > > > at http://www.ci.uchicago.edu/~jonmon/logs/montage-3.log. This run > > > was executed using coasters. The coaster log is > > > at http://www.ci.uchicago.edu/~jonmon/logs/coasters.log. > > > > > > > > > Progress: time: Thu, 01 Sep 2011 12:54:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:54:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:55:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:55:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:56:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:56:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:57:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:57:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:58:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:58:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:59:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 12:59:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:00:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:00:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:01:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:01:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:02:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:02:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:03:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:03:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:04:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:04:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:05:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:05:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:06:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:06:31 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > Progress: time: Thu, 01 Sep 2011 13:07:01 -0500 Submitted:15 > > > Active:8 Finished successfully:27 > > > > > > > > > > > > > > > It shows that 8 tasks are active but there were no jobs active in > > PADS > > > or Beagle when I checked with showq. > > > > > > > > > > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Thu Sep 1 21:52:48 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 1 Sep 2011 21:52:48 -0500 Subject: [Swift-devel] 0.93 problem: Fwd: amwg-swift In-Reply-To: <1314916510.24336.0.camel@blabla> References: <1314916510.24336.0.camel@blabla> Message-ID: I re-ran my scripts with the updated 0.93 source and they completed. Maybe this workflow was running on an out of data svn revision. On Sep 1, 2011, at 5:35 PM, Mihael Hategan wrote: > Can't tell much without a log. > > On Thu, 2011-09-01 at 17:32 -0500, Michael Wilde wrote: >> This issue reported by Sheri w the Parvis amwg script looks a lot like >> what Jon reported earlier today. Mihael, Justin, David, Jon: can you >> investigate? Jon, can you repro the problem and send in a jstack? >> >> ---------- Forwarded message ---------- >> From: Sheri Mickelson >> Date: Thu, 1 Sep 2011 16:20:21 -0500 >> Subject: Fwd: amwg-swift >> To: Michael Wilde >> Cc: Andrew Mai >> >> Hi Mike, >> >> Sorry to bother you with this, but if you have time within the next >> 1/2 hour -45 minutes can you check on a amwg-swift run Andy is trying >> on mirage2.ucar.edu? He lists all the directories below. It's trying >> to run, but all swift is doing is ... >> >> Progress: time: Thu, 01 Sep 2011 15:17:25 -0600 Initializing:450 >> Selecting site:12 Stage in:8 >> It's been repeating "Initializing:450 Selecting site:12 Stage in:8" >> for the last 40 minutes. >> >> He's using swift 0.93. Is this a swift concurrency issue? >> >> -Sheri >> >> Begin forwarded message: >> >>> From: Andrew Mai >>> Date: September 1, 2011 4:09:05 PM CDT >>> To: Sheri Mickelson >>> Subject: Re: amwg-swift >>> >>> On 09/01/2011 02:59 PM, Sheri Mickelson wrote: >>> >>>> That's not normal. >>> >>> I have to go to a CSEG meeting. I will leave it running on mirage2. >>> You can log in and watch it run, if you want to. Relevant directories: >>> >>> work dir : /glade/scratch/mai/swift-workdirectory/ >>> amwg_stats-20110901-1440-ujv4b5z8 >>> script dir : /glade/home/mai/AMWG_diag/swift >>> code base : /fis/cgd/cms/amwg/amwg_diagnostics_swift >>> output dir : /glade/scratch/mai/output-SO1 >>> >>> >>> Andy >>> >> >> > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Fri Sep 2 10:47:54 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 2 Sep 2011 10:47:54 -0500 (CDT) Subject: [Swift-devel] Swift 0.93 RC1 available In-Reply-To: <233145295.87296.1314831371473.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1693947267.89840.1314978474428.JavaMail.root@zimbra-mb2.anl.gov> Swift 0.93 RC2 is now available. It includes yesterday's coaster update. http://www.ci.uchicago.edu/swift/packages/swift-0.93RC2.tar.gz ----- Original Message ----- > From: "David Kelly" > To: "swift-devel Devel" > Sent: Wednesday, August 31, 2011 5:56:11 PM > Subject: [Swift-devel] Swift 0.93 RC1 available > Hello all, > > I just wanted to let you know that Swift 0.93 release candidate 1 is > now available at > http://www.ci.uchicago.edu/swift/packages/swift-0.93RC1.tar.gz > > Please download this and test it out a bit. If you notice any > problems, please send an email to the list or create a new bugzilla > ticket for it. Thanks! > > Regards, > David > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Sep 2 11:39:03 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 2 Sep 2011 11:39:03 -0500 (CDT) Subject: [Swift-devel] Please add trunk site guide to wwwdev Message-ID: <338682414.269969.1314981543084.JavaMail.root@zimbra.anl.gov> Hi DAvid, Can you add the trunk site guide to wwwdev? This would give me a URL I can refer people to for the grid/ tools. Do you need to integrate the 0.93 site guide improvements into trunk? Mihael and Justin, can you describe how to do svn integration from the 0.93 branch to trunk, in general? Can we do this in part first (eg, just this one doc dir) and then do the rest post 0.93 release? - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Fri Sep 2 15:35:25 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 02 Sep 2011 13:35:25 -0700 Subject: [Swift-devel] queuedsize > 0 but no job dequeued In-Reply-To: References: <1314663369.31525.0.camel@blabla> <1314667837.919.0.camel@blabla> <1314676559.1750.0.camel@blabla> <1314734754.6888.0.camel@blabla> <1314902627.21196.0.camel@blabla> Message-ID: <1314995725.31970.1.camel@blabla> I added some code to better deal with the situation (cog r3254). It now issues warnings in the log for jobs that exceed their walltime. On Thu, 2011-09-01 at 16:16 -0500, Ketan Maheshwari wrote: > Mihael, > > > That is likely. The walltime is 20 mins and most jobs as far as I know > are less than 10 mins. However, there could be outliers. These are > about 120k jobs. > > > Ketan > > On Thu, Sep 1, 2011 at 1:43 PM, Mihael Hategan > wrote: > Is there any chance that some of your jobs run longer than > their > requested walltime? > > > On Wed, 2011-08-31 at 09:04 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > I did the run with the debug enabled on coasters. Please > find the logs > > etc, for this run here: > > > > > > http://www.ci.uchicago.edu/~ketan/run25.tgz > > > > > > > > > > Note that the run went well and ran upto 20k jobs without > issues. > > After that I did not get nodes so I stopped it and resumed > it this > > morning. It ran for about 1000+ jobs and crashed with the > same error > > message. > > > > > > > > > > Regards, > > Ketan > > > > On Tue, Aug 30, 2011 at 3:05 PM, Mihael Hategan > > > wrote: > > Any chance you can re-run this with debug enabled on > coasters > > > (log4j.logger.org.globus.cog.abstraction.coaster=DEBUG)? > > > > > > On Mon, 2011-08-29 at 20:55 -0700, Mihael Hategan > wrote: > > > My bad. The info is in the swift log. > > > > > > On Mon, 2011-08-29 at 20:59 -0500, Ketan > Maheshwari wrote: > > > > This is on Beagle. I am running local:pbs > from /lustre. > > > > > > > > On Mon, Aug 29, 2011 at 8:30 PM, Mihael Hategan > > > > > > wrote: > > > > On Mon, 2011-08-29 at 19:52 -0500, Ketan > > Maheshwari wrote: > > > > > Mihael, > > > > > > > > > > > > > > > This run was with automatic coasters. > I do not > > see any > > > > specific > > > > > coasters.log file written during this > run > > in .globus/coaster > > > > nor in > > > > > the run's work dir. > > > > > > > > > > > > It's on the remote site > in .globus/coasters. > > > > > > > > > > > > > > > > > > > Ketan > > > > > > > > > > On Mon, Aug 29, 2011 at 7:16 PM, > Mihael Hategan > > > > > > > > > wrote: > > > > > Can I have the coasters log > please? > > > > > > > > > > > > > > > On Sun, 2011-08-28 at 16:47 > -0500, Ketan > > Maheshwari > > > > wrote: > > > > > > Hello, > > > > > > > > > > > > > > > > > > I remember this error > happened in the > > past with > > > > Glen's and > > > > > Sheri's > > > > > > runs. I saw this today again > on Beagle > > with 0.93 > > > > while > > > > > running the > > > > > > DSSAT run. > > > > > > > > > > > > > > > > > > The run stops with the > following > > complete message: > > > > > > > > > > > > > > > > > > queuedsize > 0 but no job > dequeued. > > Queued: {} > > > > > > java.lang.Throwable > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > > > > queuedsize > 0 but no job > dequeued. > > Queued: {} > > > > > > java.lang.Throwable > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269) > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539) > > > > > > at > > > > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110) > > > > > > Progress: time: Sun, 28 Aug > 2011 > > 13:34:26 -0600 > > > > > Submitted:76 > > > > > > Active:23 Checking > status:1 > > Finished > > > > successfully:597 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The logs, properties and > sources for > > this run are: > > > > > > > > http://www.ci.uchicago.edu/~ketan/run23.tgz > > > > > > > > > > > > > > > > > > Regards, > > > > > > -- > > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > > > > > > -- > Ketan > > > From hategan at mcs.anl.gov Fri Sep 2 15:43:00 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 02 Sep 2011 13:43:00 -0700 Subject: [Swift-devel] Release notes for 0.92? In-Reply-To: References: Message-ID: <1314996180.32286.1.camel@blabla> On Thu, 2011-09-01 at 13:42 -0500, Thomas Uram wrote: > Maybe I've overlooked, but while I notice that earlier releases link > to release notes, 0.92 does not do this: > > > http://www.ci.uchicago.edu/swift/downloads/ > > > Does such exist somewhere? Specifically, I'm wondering if passive > coasters support is in 0.92 or if it's yet to come. Apparently the code is there. It is likely that it was the first iteration, so I'm not sure how good it is. I would suggest waiting for 0.93 which seems to be just around the corner and in which passive coasters have been tested considerably more. Mihael From iraicu at cs.iit.edu Mon Sep 5 12:41:50 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Mon, 05 Sep 2011 12:41:50 -0500 Subject: [Swift-devel] ACM MTAGS 2011: deadline extension to September 26th Message-ID: <4E6509DE.7090204@cs.iit.edu> CALL FOR PAPERS 4th Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2011 http://datasys.cs.iit.edu/events/MTAGS11/index.html http://sc11.supercomputing.org/schedule/event_detail.php?evid=wksp122 ***DEADLINE EXTENSION -- September 26th, 2011* The 4th workshop on Many-Task Computing on Grids and Supercomputers (MTAGS11) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of large-scale many-task computing (MTC) applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This workshop will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, and application scalability. *General Chairs (mtags11-chairs at datasys.cs.iit.edu )* * Ioan Raicu, Illinois Institute of Technology & Argonne National Laboratory, USA * Ian Foster, University of Chicago & Argonne National Laboratory, USA * Yong Zhao, University of Electronic Science and Technology of China, China -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Mon Sep 5 12:50:31 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Mon, 05 Sep 2011 12:50:31 -0500 Subject: [Swift-devel] ACM DataCloud-SC11 2011: deadline extension to September 26th Message-ID: <4E650BE7.1020700@cs.iit.edu> CALL FOR PAPERS The Second International Workshop on Data-Intensive Computing in the Clouds (DataCloud-SC11) 2011 http://datasys.cs.iit.edu/events/DataCloud-SC11/index.html http://sc11.supercomputing.org/schedule/event_detail.php?evid=wksp119 ***DEADLINE EXTENSION -- September 26th, 2011* The second international workshop on Data-intensive Computing in the Clouds (DataCloud-SC11) will provide the scientific community a dedicated forum for discussing new research, development, and deployment efforts in running data-intensive computing workloads on Cloud Computing infrastructures. The DataCloud-SC11 workshop will focus on the use of cloud-based technologies to meet the new data intensive scientific challenges that are not well served by the current supercomputers, grids or compute-intensive clouds. We believe the workshop will be an excellent place to help the community define the current state, determine future goals, and present architectures and services for future clouds supporting data intensive computing. *General Chairs (datacloud-sc11-chairs at datasys.cs.iit.edu )* * Ioan Raicu, Illinois Institute of Technology & Argonne National Laboratory, USA * Tevfik Kosar, University at Buffalo, USA * Roger Barga, Microsoft Research, USA -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Sep 7 12:08:51 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 7 Sep 2011 12:08:51 -0500 (CDT) Subject: [Swift-devel] Error in trunk version of chxml? In-Reply-To: <52B7633F-2BD7-434E-AC34-93E05F8C7FA0@mcs.anl.gov> Message-ID: <1705782374.281713.1315415331865.JavaMail.root@zimbra.anl.gov> Is anyone seeing the error below? ----- Forwarded Message ----- ... Sent: Wednesday, September 7, 2011 10:43:42 AM Subject: Re: Rev 2 of Swift Pagoda test script works Hi Mike, I tried the trunk branch (from this morning) and got this error: Comparison against observations File "/fusion/gpfs/home/mickelso/soft/swift-trunk/cog/modules/swift/ dist/swift-svn/bin/../bin/chxml", line 29 except IOError as (errno, strerror): ^ SyntaxError: invalid syntax Could not process input files! Is this a swift error? Or am I doing something wrong? Do you have a version on fusion that I can try using? From wozniak at mcs.anl.gov Wed Sep 7 12:47:19 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 7 Sep 2011 12:47:19 -0500 (Central Daylight Time) Subject: [Swift-devel] Error in trunk version of chxml? In-Reply-To: <1705782374.281713.1315415331865.JavaMail.root@zimbra.anl.gov> References: <1705782374.281713.1315415331865.JavaMail.root@zimbra.anl.gov> Message-ID: Yes, I made a change that requires Python 2.6, fixing now... On Wed, 7 Sep 2011, Michael Wilde wrote: > Is anyone seeing the error below? > > ----- Forwarded Message ----- > ... > Sent: Wednesday, September 7, 2011 10:43:42 AM > Subject: Re: Rev 2 of Swift Pagoda test script works > > Hi Mike, > > I tried the trunk branch (from this morning) and got this error: > > Comparison against observations > File "/fusion/gpfs/home/mickelso/soft/swift-trunk/cog/modules/swift/ > dist/swift-svn/bin/../bin/chxml", line 29 > except IOError as (errno, strerror): > ^ > SyntaxError: invalid syntax > Could not process input files! > > > Is this a swift error? Or am I doing something wrong? Do you have a > version on fusion that I can try using? > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Justin M Wozniak From wozniak at mcs.anl.gov Wed Sep 7 12:52:58 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Wed, 7 Sep 2011 12:52:58 -0500 (Central Daylight Time) Subject: [Swift-devel] Error in trunk version of chxml? In-Reply-To: References: <1705782374.281713.1315415331865.JavaMail.root@zimbra.anl.gov> Message-ID: Ok, please try again. On Wed, 7 Sep 2011, Justin M Wozniak wrote: > > Yes, I made a change that requires Python 2.6, fixing now... > > On Wed, 7 Sep 2011, Michael Wilde wrote: > >> Is anyone seeing the error below? >> >> ----- Forwarded Message ----- >> ... >> Sent: Wednesday, September 7, 2011 10:43:42 AM >> Subject: Re: Rev 2 of Swift Pagoda test script works >> >> Hi Mike, >> >> I tried the trunk branch (from this morning) and got this error: >> >> Comparison against observations >> File "/fusion/gpfs/home/mickelso/soft/swift-trunk/cog/modules/swift/ >> dist/swift-svn/bin/../bin/chxml", line 29 >> except IOError as (errno, strerror): >> ^ >> SyntaxError: invalid syntax >> Could not process input files! >> >> >> Is this a swift error? Or am I doing something wrong? Do you have a >> version on fusion that I can try using? >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> > > -- Justin M Wozniak From hategan at mcs.anl.gov Wed Sep 7 13:15:37 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 07 Sep 2011 11:15:37 -0700 Subject: [Swift-devel] 0.93 Message-ID: <1315419337.22542.0.camel@blabla> Are there any remaining 0.93 bugs that did not go into bugzilla or get fixed? From davidk at ci.uchicago.edu Wed Sep 7 13:45:40 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 7 Sep 2011 13:45:40 -0500 (CDT) Subject: [Swift-devel] Please add trunk site guide to wwwdev In-Reply-To: <338682414.269969.1314981543084.JavaMail.root@zimbra.anl.gov> Message-ID: <262632650.95296.1315421140494.JavaMail.root@zimbra-mb2.anl.gov> Mike, I added a link on wwwdev to the trunk site guide. I did the updates to the docs with the svn merge command. These are the steps I used: cd swift-trunk/cog/modules/swift/docs svn merge https://svn.ci.uchicago.edu/svn/vdl2/branches/release-0.93/docs https://svn.ci.uchicago.edu/svn/vdl2/trunk/docs svn commit I believe the same steps could be used to merge the 0.93 fixes into trunk. David ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" > Cc: "swift-devel Devel" > Sent: Friday, September 2, 2011 11:39:03 AM > Subject: [Swift-devel] Please add trunk site guide to wwwdev > Hi DAvid, > > Can you add the trunk site guide to wwwdev? > > This would give me a URL I can refer people to for the grid/ tools. > > Do you need to integrate the 0.93 site guide improvements into trunk? > > Mihael and Justin, can you describe how to do svn integration from the > 0.93 branch to trunk, in general? Can we do this in part first (eg, > just this one doc dir) and then do the rest post 0.93 release? > > - Mike > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Wed Sep 7 15:03:13 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 7 Sep 2011 15:03:13 -0500 (CDT) Subject: [Swift-devel] 0.93 In-Reply-To: <1315419337.22542.0.camel@blabla> Message-ID: <1963806355.282470.1315425793623.JavaMail.root@zimbra.anl.gov> One item came up today: we forgot to put the needed lines for performance plotting into log4j.properties. Ketan filed this as bug 345 just now. I assigned to David pending discussion of whether this can/should go into 0.93. Lets see if anything else comes up and if another round of testing is warranted. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Swift Devel" > Sent: Wednesday, September 7, 2011 1:15:37 PM > Subject: [Swift-devel] 0.93 > Are there any remaining 0.93 bugs that did not go into bugzilla or get > fixed? > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Fri Sep 9 00:28:38 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 08 Sep 2011 22:28:38 -0700 Subject: [Swift-devel] Path.parse(path.toString()) Message-ID: <1315546118.7242.4.camel@blabla> That bit of code was in AppStageouts and it caused a failure in a script. The failure would manifest itself through some error message complaining about an invalid double-dotted path: Application exception: org.griphyn.vdl.mapping.InvalidPathException: Invalid path (..index) for modelData:LoopModelData - Open It's fixed in swift r5099. But it does bring the question of why it wasn't being caught by tests, since it should be triggered in all instances when a simple struct is being staged out (which perhaps we're not testing). From davidk at ci.uchicago.edu Fri Sep 9 01:10:56 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 9 Sep 2011 01:10:56 -0500 (CDT) Subject: [Swift-devel] Path.parse(path.toString()) In-Reply-To: <1315546118.7242.4.camel@blabla> Message-ID: <949271943.98225.1315548656927.JavaMail.root@zimbra-mb2.anl.gov> It looks like there are a few scripts in the test suite which write structs and seem to be passing. Do you happen to have an example script I could add to the suite? Here is the test suite output for past few days (still a work in progress - needs better formatting): http://www.ci.uchicago.edu/swift/wwwdev/tests/tests.pl It looks like there are some failures in trunk related to the declaring functions(?) but I think that's a separate issue. I'll file a ticket for those tomorrow. David ----- Original Message ----- > From: "Mihael Hategan" > To: "Swift Devel" > Sent: Friday, September 9, 2011 12:28:38 AM > Subject: [Swift-devel] Path.parse(path.toString()) > That bit of code was in AppStageouts and it caused a failure in a > script. > > The failure would manifest itself through some error message > complaining > about an invalid double-dotted path: > Application exception: org.griphyn.vdl.mapping.InvalidPathException: > Invalid path (..index) for modelData:LoopModelData - Open > > It's fixed in swift r5099. > > But it does bring the question of why it wasn't being caught by tests, > since it should be triggered in all instances when a simple struct is > being staged out (which perhaps we're not testing). > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri Sep 9 01:31:44 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 08 Sep 2011 23:31:44 -0700 Subject: [Swift-devel] Path.parse(path.toString()) In-Reply-To: <949271943.98225.1315548656927.JavaMail.root@zimbra-mb2.anl.gov> References: <949271943.98225.1315548656927.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1315549904.7449.1.camel@blabla> On Fri, 2011-09-09 at 01:10 -0500, David Kelly wrote: > It looks like there are a few scripts in the test suite which write structs and seem to be passing. I suspect they are not staging structs out (just in). > Do you happen to have an example script I could add to the suite? type file; type struct { file b; file c; } app (struct of) echo() { bash "echo bee >1 ; echo see >2" stdout=@filename(of.b) stderr=@filename(of.c); } struct s ; s = echo(); (I'm not sure if this actually works if the bug is fixed, but I know it doesn't if it isn't. Anyway, you get the idea). From wilde at mcs.anl.gov Fri Sep 9 08:41:31 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 9 Sep 2011 08:41:31 -0500 (CDT) Subject: [Swift-devel] Next steps on SwiftR - re Fwd: [Swift-user] Do you have any resource for learning about SwiftR? In-Reply-To: <20110907193001.AVI46299@mstore03.uchicago.edu> Message-ID: <281051492.289080.1315575691955.JavaMail.root@zimbra.anl.gov> Comments inline... ----- Original Message ----- > From: "Tim Armstrong" > Subject: Re: Next steps on SwiftR - re Fwd: [Swift-user] Do you have any resource for learning about SwiftR? >... > I'm going to have a look at the CRAN docs. It sounds like it won't > meet CRAN standards as it requires a pre-built version of swift to be > bundled. I can look into what it would take to get it up to spec > there. It should be feasible to do an svn co of Swift source and build it to create the Swift dist from within the SwiftR build. Would that pass CRAN requirements? Only way to find out I think is to test it and then submit to CRAN. > I think we only have two serious sources of documentation: the wiki > page and the R docs. My website was only really intended to show the > benchmarks. I've removed the installation instructions from that page, > so there is now no real documentation on there. I agree it would make > sense to consolidate the documentation. Maybe the most sensible > approach would be to have the R help docs be the canonical version, > and generate a web version of those docs that would be released at the > same time as new versions of the package. I agree - that sounds best. Lets post it on the Swift web and also link there from the OpenMx wiki. > Then all the authoritative documentation (including Quickstart, FAQ, > Troubleshooting, etc) would be consolidated in that place. Yes, that seems the best approach. > Any > supplemental documentation that doesn't really make sense to include > in the CRAN package, e.g. about running SwiftR on particular clusters, > could then remain on the SWFT wiki. Lets move that to asciidoc within the Swift web. > I agree with the additional sites. Beagle we should get going, and > having good support for Teragrid and OSG makes a lot of sense. I'm > still not sure what the best way to manage custom Swift configs would > be. I think that leaving users to hand-write configs from scratch is > too error-prone. Good documentation + a bunch of working examples is > probably the most realistic solution, but it would be nice to allow a > wider set of configurations to be used directly from R without editing > swift configs. The one thing that springs to mind now is automatic > resource provisioning: I have that working with PBS, but not with the > full range of clusters We should integrate this with Swift gensites which David is developing to make swift configuration easier. - Mike > > - Tim > > ---- Original message ---- > >Date: Tue, 30 Aug 2011 10:34:39 -0500 (CDT) > >From: Michael Wilde > >Subject: Next steps on SwiftR - re Fwd: [Swift-user] Do you have any > >resource for learning about SwiftR? > >To: Tim Armstrong , David Kelly > > > >Cc: swift-devel Devel , Lorenzo Pesce > > > > > >David, can you schedule some time to meet with Tim, learn and try > >SwiftR, and work to publicize it on the Swift web? > > > >Tim, do you think its ready to submit to CRAN? Does it meet the CRAN > >install criteria wrt building all included packages from source? (Ie, > >Swift?) > > > >How should we handle the docs issue? I think we need to consolidate > >docs from the at least 3 sources I know about and get them into both > >the R help and echo them into an asciidoc page on the Swift web? > >(SwiftR R help; Tim's page; SWFT pages(2); also, do we still have > >docs on it on the OpenMx wiki? Last I looked the Swift help page > >there now refers back to SWFT pages). > > > >Lastly, what do we need to do for additional SwiftR site support? > >- config and test on beagle > >- integrate with more swift configs > >- test and support for TG/XSEDE and OSG > > > >Thanks, > > > >- Mike > > > > > >----- Forwarded Message ----- > >From: "Michael Wilde" > >To: "Lorenzo Pesce" > >Cc: "Tim Armstrong" > >Sent: Tuesday, August 30, 2011 10:22:06 AM > >Subject: Re: [Swift-user] Do you have any resource for learning about > >SwiftR? > > > >Lorenzo, > > > >The SwiftR documentation is currently at: > > > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftR > > > >which also provides a quick start guide at: > > > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftRQuickstart > > > >Further examples and some performance measurements are at: > > > > http://people.cs.uchicago.edu/~tga/swiftR/ > > > >And more examples are available with ?SwiftR help once you load the > >package: > >> source("http://people.cs.uchicago.edu/~tga/swiftR/getSwift.R") > > > >I just built an R-2.13.1 release on Beagle with plain gcc, which I > >think *should* be runnable in parallel on worker nodes. (Not yet > >tested though). This R should be capable of running SwiftR. Im hoping > >that Tim cam verify this soon. We'll likely need an additional SwiftR > >server name and config for Beagle and other Cray systems. > > > >We'll try to consolidate the SwiftR documentation in a user guide on > >the Swift in the future. Tim, can you do a quick check of the > >documentation to make sure its still correct and that it points to > >the latest SwiftR package? > > > >Thanks, > > > >- Mike > > > > > >----- Original Message ----- > >> From: "Lorenzo Pesce" > >> To: swift-user at ci.uchicago.edu > >> Sent: Tuesday, August 30, 2011 9:22:51 AM > >> Subject: [Swift-user] Do you have any resource for learning about > >> SwiftR? > >> Hi - > >> > >> I want to run relatively small sized simulations (say at most 50 > >> cores > >> or so, probably mostly one or two) but many many times over. The > >> simulations will be coded in R. > >> > >> Thanks a lot! > >> > >> Lorenzo > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > >-- > >Michael Wilde > >Computation Institute, University of Chicago > >Mathematics and Computer Science Division > >Argonne National Laboratory > > > >-- > >Michael Wilde > >Computation Institute, University of Chicago > >Mathematics and Computer Science Division > >Argonne National Laboratory > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From dsk at ci.uchicago.edu Fri Sep 9 08:44:42 2011 From: dsk at ci.uchicago.edu (Daniel S. Katz) Date: Fri, 9 Sep 2011 09:44:42 -0400 Subject: [Swift-devel] Next steps on SwiftR - re Fwd: [Swift-user] Do you have any resource for learning about SwiftR? In-Reply-To: <281051492.289080.1315575691955.JavaMail.root@zimbra.anl.gov> References: <281051492.289080.1315575691955.JavaMail.root@zimbra.anl.gov> Message-ID: <2F33B364-6B5C-4571-9620-9FAB8E9AE0C3@ci.uchicago.edu> Have you looked at the Technology Insertion Service (TIS) in XSEDE? This would be a way to push Swift (SwiftR?) forward there. Dan On Sep 9, 2011, at 9:41 AM, Michael Wilde wrote: > Comments inline... > > ----- Original Message ----- >> From: "Tim Armstrong" >> Subject: Re: Next steps on SwiftR - re Fwd: [Swift-user] Do you have any resource for learning about SwiftR? >> ... >> I'm going to have a look at the CRAN docs. It sounds like it won't >> meet CRAN standards as it requires a pre-built version of swift to be >> bundled. I can look into what it would take to get it up to spec >> there. > > It should be feasible to do an svn co of Swift source and build it to create the Swift dist from within the SwiftR build. Would that pass CRAN requirements? Only way to find out I think is to test it and then submit to CRAN. > >> I think we only have two serious sources of documentation: the wiki >> page and the R docs. My website was only really intended to show the >> benchmarks. I've removed the installation instructions from that page, >> so there is now no real documentation on there. I agree it would make >> sense to consolidate the documentation. Maybe the most sensible >> approach would be to have the R help docs be the canonical version, >> and generate a web version of those docs that would be released at the >> same time as new versions of the package. > > I agree - that sounds best. Lets post it on the Swift web and also link there from the OpenMx wiki. > >> Then all the authoritative documentation (including Quickstart, FAQ, >> Troubleshooting, etc) would be consolidated in that place. > > Yes, that seems the best approach. > >> Any >> supplemental documentation that doesn't really make sense to include >> in the CRAN package, e.g. about running SwiftR on particular clusters, >> could then remain on the SWFT wiki. > > Lets move that to asciidoc within the Swift web. > >> I agree with the additional sites. Beagle we should get going, and >> having good support for Teragrid and OSG makes a lot of sense. I'm >> still not sure what the best way to manage custom Swift configs would >> be. I think that leaving users to hand-write configs from scratch is >> too error-prone. Good documentation + a bunch of working examples is >> probably the most realistic solution, but it would be nice to allow a >> wider set of configurations to be used directly from R without editing >> swift configs. The one thing that springs to mind now is automatic >> resource provisioning: I have that working with PBS, but not with the >> full range of clusters > > We should integrate this with Swift gensites which David is developing to make swift configuration easier. > > - Mike > >> >> - Tim >> >> ---- Original message ---- >>> Date: Tue, 30 Aug 2011 10:34:39 -0500 (CDT) >>> From: Michael Wilde >>> Subject: Next steps on SwiftR - re Fwd: [Swift-user] Do you have any >>> resource for learning about SwiftR? >>> To: Tim Armstrong , David Kelly >>> >>> Cc: swift-devel Devel , Lorenzo Pesce >>> >>> >>> David, can you schedule some time to meet with Tim, learn and try >>> SwiftR, and work to publicize it on the Swift web? >>> >>> Tim, do you think its ready to submit to CRAN? Does it meet the CRAN >>> install criteria wrt building all included packages from source? (Ie, >>> Swift?) >>> >>> How should we handle the docs issue? I think we need to consolidate >>> docs from the at least 3 sources I know about and get them into both >>> the R help and echo them into an asciidoc page on the Swift web? >>> (SwiftR R help; Tim's page; SWFT pages(2); also, do we still have >>> docs on it on the OpenMx wiki? Last I looked the Swift help page >>> there now refers back to SWFT pages). >>> >>> Lastly, what do we need to do for additional SwiftR site support? >>> - config and test on beagle >>> - integrate with more swift configs >>> - test and support for TG/XSEDE and OSG >>> >>> Thanks, >>> >>> - Mike >>> >>> >>> ----- Forwarded Message ----- >>> From: "Michael Wilde" >>> To: "Lorenzo Pesce" >>> Cc: "Tim Armstrong" >>> Sent: Tuesday, August 30, 2011 10:22:06 AM >>> Subject: Re: [Swift-user] Do you have any resource for learning about >>> SwiftR? >>> >>> Lorenzo, >>> >>> The SwiftR documentation is currently at: >>> >>> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftR >>> >>> which also provides a quick start guide at: >>> >>> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftRQuickstart >>> >>> Further examples and some performance measurements are at: >>> >>> http://people.cs.uchicago.edu/~tga/swiftR/ >>> >>> And more examples are available with ?SwiftR help once you load the >>> package: >>>> source("http://people.cs.uchicago.edu/~tga/swiftR/getSwift.R") >>> >>> I just built an R-2.13.1 release on Beagle with plain gcc, which I >>> think *should* be runnable in parallel on worker nodes. (Not yet >>> tested though). This R should be capable of running SwiftR. Im hoping >>> that Tim cam verify this soon. We'll likely need an additional SwiftR >>> server name and config for Beagle and other Cray systems. >>> >>> We'll try to consolidate the SwiftR documentation in a user guide on >>> the Swift in the future. Tim, can you do a quick check of the >>> documentation to make sure its still correct and that it points to >>> the latest SwiftR package? >>> >>> Thanks, >>> >>> - Mike >>> >>> >>> ----- Original Message ----- >>>> From: "Lorenzo Pesce" >>>> To: swift-user at ci.uchicago.edu >>>> Sent: Tuesday, August 30, 2011 9:22:51 AM >>>> Subject: [Swift-user] Do you have any resource for learning about >>>> SwiftR? >>>> Hi - >>>> >>>> I want to run relatively small sized simulations (say at most 50 >>>> cores >>>> or so, probably mostly one or two) but many many times over. The >>>> simulations will be coded in R. >>>> >>>> Thanks a lot! >>>> >>>> Lorenzo >>>> _______________________________________________ >>>> Swift-user mailing list >>>> Swift-user at ci.uchicago.edu >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> >>> -- >>> Michael Wilde >>> Computation Institute, University of Chicago >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Daniel S. Katz University of Chicago (773) 834-7186 (voice) (773) 834-6818 (fax) d.katz at ieee.org or dsk at ci.uchicago.edu http://www.ci.uchicago.edu/~dsk/ From wilde at mcs.anl.gov Fri Sep 9 08:49:20 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 9 Sep 2011 08:49:20 -0500 (CDT) Subject: [Swift-devel] Next steps on SwiftR - re Fwd: [Swift-user] Do you have any resource for learning about SwiftR? In-Reply-To: <2F33B364-6B5C-4571-9620-9FAB8E9AE0C3@ci.uchicago.edu> Message-ID: <151266971.289102.1315576160125.JavaMail.root@zimbra.anl.gov> Thats an excellent idea Dan. David, can you explore this and report back on whats involved? (We should still work to get SwiftR into CRAN as that's the expected source of all R packages). Thanks, - Mike ----- Original Message ----- > From: "Daniel S. Katz" > To: "Michael Wilde" > Cc: "Tim Armstrong" , "Lorenzo Pesce" , "swift-devel Devel" > > Sent: Friday, September 9, 2011 8:44:42 AM > Subject: Re: [Swift-devel] Next steps on SwiftR - re Fwd: [Swift-user] Do you have any resource for learning about > SwiftR? > Have you looked at the Technology Insertion Service (TIS) in XSEDE? > This would be a way to push Swift (SwiftR?) forward there. > > Dan > > > On Sep 9, 2011, at 9:41 AM, Michael Wilde wrote: > > > Comments inline... > > > > ----- Original Message ----- > >> From: "Tim Armstrong" > >> Subject: Re: Next steps on SwiftR - re Fwd: [Swift-user] Do you > >> have any resource for learning about SwiftR? > >> ... > >> I'm going to have a look at the CRAN docs. It sounds like it won't > >> meet CRAN standards as it requires a pre-built version of swift to > >> be > >> bundled. I can look into what it would take to get it up to spec > >> there. > > > > It should be feasible to do an svn co of Swift source and build it > > to create the Swift dist from within the SwiftR build. Would that > > pass CRAN requirements? Only way to find out I think is to test it > > and then submit to CRAN. > > > >> I think we only have two serious sources of documentation: the wiki > >> page and the R docs. My website was only really intended to show > >> the > >> benchmarks. I've removed the installation instructions from that > >> page, > >> so there is now no real documentation on there. I agree it would > >> make > >> sense to consolidate the documentation. Maybe the most sensible > >> approach would be to have the R help docs be the canonical version, > >> and generate a web version of those docs that would be released at > >> the > >> same time as new versions of the package. > > > > I agree - that sounds best. Lets post it on the Swift web and also > > link there from the OpenMx wiki. > > > >> Then all the authoritative documentation (including Quickstart, > >> FAQ, > >> Troubleshooting, etc) would be consolidated in that place. > > > > Yes, that seems the best approach. > > > >> Any > >> supplemental documentation that doesn't really make sense to > >> include > >> in the CRAN package, e.g. about running SwiftR on particular > >> clusters, > >> could then remain on the SWFT wiki. > > > > Lets move that to asciidoc within the Swift web. > > > >> I agree with the additional sites. Beagle we should get going, and > >> having good support for Teragrid and OSG makes a lot of sense. I'm > >> still not sure what the best way to manage custom Swift configs > >> would > >> be. I think that leaving users to hand-write configs from scratch > >> is > >> too error-prone. Good documentation + a bunch of working examples > >> is > >> probably the most realistic solution, but it would be nice to allow > >> a > >> wider set of configurations to be used directly from R without > >> editing > >> swift configs. The one thing that springs to mind now is automatic > >> resource provisioning: I have that working with PBS, but not with > >> the > >> full range of clusters > > > > We should integrate this with Swift gensites which David is > > developing to make swift configuration easier. > > > > - Mike > > > >> > >> - Tim > >> > >> ---- Original message ---- > >>> Date: Tue, 30 Aug 2011 10:34:39 -0500 (CDT) > >>> From: Michael Wilde > >>> Subject: Next steps on SwiftR - re Fwd: [Swift-user] Do you have > >>> any > >>> resource for learning about SwiftR? > >>> To: Tim Armstrong , David Kelly > >>> > >>> Cc: swift-devel Devel , Lorenzo Pesce > >>> > >>> > >>> David, can you schedule some time to meet with Tim, learn and try > >>> SwiftR, and work to publicize it on the Swift web? > >>> > >>> Tim, do you think its ready to submit to CRAN? Does it meet the > >>> CRAN > >>> install criteria wrt building all included packages from source? > >>> (Ie, > >>> Swift?) > >>> > >>> How should we handle the docs issue? I think we need to > >>> consolidate > >>> docs from the at least 3 sources I know about and get them into > >>> both > >>> the R help and echo them into an asciidoc page on the Swift web? > >>> (SwiftR R help; Tim's page; SWFT pages(2); also, do we still have > >>> docs on it on the OpenMx wiki? Last I looked the Swift help page > >>> there now refers back to SWFT pages). > >>> > >>> Lastly, what do we need to do for additional SwiftR site support? > >>> - config and test on beagle > >>> - integrate with more swift configs > >>> - test and support for TG/XSEDE and OSG > >>> > >>> Thanks, > >>> > >>> - Mike > >>> > >>> > >>> ----- Forwarded Message ----- > >>> From: "Michael Wilde" > >>> To: "Lorenzo Pesce" > >>> Cc: "Tim Armstrong" > >>> Sent: Tuesday, August 30, 2011 10:22:06 AM > >>> Subject: Re: [Swift-user] Do you have any resource for learning > >>> about > >>> SwiftR? > >>> > >>> Lorenzo, > >>> > >>> The SwiftR documentation is currently at: > >>> > >>> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftR > >>> > >>> which also provides a quick start guide at: > >>> > >>> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftRQuickstart > >>> > >>> Further examples and some performance measurements are at: > >>> > >>> http://people.cs.uchicago.edu/~tga/swiftR/ > >>> > >>> And more examples are available with ?SwiftR help once you load > >>> the > >>> package: > >>>> source("http://people.cs.uchicago.edu/~tga/swiftR/getSwift.R") > >>> > >>> I just built an R-2.13.1 release on Beagle with plain gcc, which I > >>> think *should* be runnable in parallel on worker nodes. (Not yet > >>> tested though). This R should be capable of running SwiftR. Im > >>> hoping > >>> that Tim cam verify this soon. We'll likely need an additional > >>> SwiftR > >>> server name and config for Beagle and other Cray systems. > >>> > >>> We'll try to consolidate the SwiftR documentation in a user guide > >>> on > >>> the Swift in the future. Tim, can you do a quick check of the > >>> documentation to make sure its still correct and that it points to > >>> the latest SwiftR package? > >>> > >>> Thanks, > >>> > >>> - Mike > >>> > >>> > >>> ----- Original Message ----- > >>>> From: "Lorenzo Pesce" > >>>> To: swift-user at ci.uchicago.edu > >>>> Sent: Tuesday, August 30, 2011 9:22:51 AM > >>>> Subject: [Swift-user] Do you have any resource for learning about > >>>> SwiftR? > >>>> Hi - > >>>> > >>>> I want to run relatively small sized simulations (say at most 50 > >>>> cores > >>>> or so, probably mostly one or two) but many many times over. The > >>>> simulations will be coded in R. > >>>> > >>>> Thanks a lot! > >>>> > >>>> Lorenzo > >>>> _______________________________________________ > >>>> Swift-user mailing list > >>>> Swift-user at ci.uchicago.edu > >>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>> > >>> -- > >>> Michael Wilde > >>> Computation Institute, University of Chicago > >>> Mathematics and Computer Science Division > >>> Argonne National Laboratory > >>> > >>> -- > >>> Michael Wilde > >>> Computation Institute, University of Chicago > >>> Mathematics and Computer Science Division > >>> Argonne National Laboratory > >>> > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Daniel S. Katz > University of Chicago > (773) 834-7186 (voice) > (773) 834-6818 (fax) > d.katz at ieee.org or dsk at ci.uchicago.edu > http://www.ci.uchicago.edu/~dsk/ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Sep 9 09:33:23 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 9 Sep 2011 09:33:23 -0500 (CDT) Subject: [Swift-devel] Path.parse(path.toString()) In-Reply-To: <1315549904.7449.1.camel@blabla> Message-ID: <1133793033.289395.1315578803252.JavaMail.root@zimbra.anl.gov> So far Im not able to re-create this problem on the prior revision of 0.93. Im currently using this variation of the script Mihael posted below: type file; type struct { file b; file c; } app (struct of) echo() { sh "-c" "echo bee >fileb ; echo see >filec"; } struct s ; s = echo(); I discovered what I think is a new bug in the process: if the shell command string contains the characters >&1 and >&2, Swift complains that it cant process the intermediate xml. Presumably those shell sequences are getting xml parsing confused and need to be escaped. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "Swift Devel" > Sent: Friday, September 9, 2011 1:31:44 AM > Subject: Re: [Swift-devel] Path.parse(path.toString()) > On Fri, 2011-09-09 at 01:10 -0500, David Kelly wrote: > > It looks like there are a few scripts in the test suite which write > > structs and seem to be passing. > > I suspect they are not staging structs out (just in). > > > Do you happen to have an example script I could add to the suite? > > type file; > > type struct { > file b; > file c; > } > > app (struct of) echo() { > bash "echo bee >1 ; echo see >2" stdout=@filename(of.b) > stderr=@filename(of.c); > } > > struct s ; > > s = echo(); > > (I'm not sure if this actually works if the bug is fixed, but I know > it > doesn't if it isn't. Anyway, you get the idea). > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From ketancmaheshwari at gmail.com Fri Sep 9 11:52:17 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Fri, 9 Sep 2011 11:52:17 -0500 Subject: [Swift-devel] Persistent coasters on OSG Swift not getting started cores Message-ID: Hi Mihael, All, I am trying to run the DSSAT workflow, a simple one process catsn-like loop. The setup on OSG is persisten coasters based with the following elements: 1. A coaster service is started on the head node 2. Workers are started on OSG sites. I am using 11 OSG sites. 3. The workers are submitted in the form of condor jobs which connect back to the service running at the headnode. 4. In the current instance that I am running, 500 workers are submitted to start, out of which 280 workers are in running state as of now. My throttles: jobthrottle, foreach throttle are set to run 500 tasks at a time. However, I am seeing a see-saw pattern of active tasks whose peak is very low. What I am seeing is: the number of active tasks start rising gradually from 0 to about 30 followed by a decrease from 30 to 0 and back to 30. The logs and sources are at : http://ci.uchicago.edu/~ketan/DSSAT-logs.tgz This tarball contains the following: DSSAT-logs/sites.grid-ps.xml DSSAT-logs/tc-provider-staging DSSAT-logs/cf.ps DSSAT-logs/RunDSSAT.swift Condor, swift logs DSSAT-logs/condor.log DSSAT-logs/swift.log Service and worker's stdouts DSSAT-logs/service-0.out DSSAT-logs/swift-workers.out Three runlogs since the run was resumed twice: DSSAT-logs/RunDSSAT-20110909-1025-hjcelum9.log DSSAT-logs/RunDSSAT-20110909-1030-jjefp0sb.log DSSAT-logs/RunDSSAT-20110909-0918-0hk7ign5.log Any insights would be helpful. Regards, -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketancmaheshwari at gmail.com Fri Sep 9 12:03:07 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Fri, 9 Sep 2011 12:03:07 -0500 Subject: [Swift-devel] Persistent coasters on OSG Swift not getting started cores In-Reply-To: References: Message-ID: In addition, I see the following timeout messages after about 2 hours into running workflow: org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Command(2883, HEARTBEAT): handling reply timeout; sendReqTime=110909-115956.899, sendTime=110909-115956.899, now=110909-120156.901 Command(2883, HEARTBEAT)fault was: Reply timeout org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Command(2887, HEARTBEAT): handling reply timeout; sendReqTime=110909-120003.463, sendTime=110909-120003.463, now=110909-120203.468 Command(2887, HEARTBEAT)fault was: Reply timeout org.globus.cog.karajan.workflow.service.ReplyTimeoutException at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288) at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) Regards, Ketan On Fri, Sep 9, 2011 at 11:52 AM, Ketan Maheshwari < ketancmaheshwari at gmail.com> wrote: > Hi Mihael, All, > > I am trying to run the DSSAT workflow, a simple one process catsn-like > loop. > > The setup on OSG is persisten coasters based with the following elements: > > 1. A coaster service is started on the head node > 2. Workers are started on OSG sites. I am using 11 OSG sites. > 3. The workers are submitted in the form of condor jobs which connect back > to the service running at the headnode. > 4. In the current instance that I am running, 500 workers are submitted to > start, out of which 280 workers are in running state as of now. > > My throttles: jobthrottle, foreach throttle are set to run 500 tasks at a > time. > > However, I am seeing a see-saw pattern of active tasks whose peak is very > low. What I am seeing is: the number of active tasks start rising gradually > from 0 to about 30 followed by a decrease from 30 to 0 and back to 30. > > The logs and sources are at : http://ci.uchicago.edu/~ketan/DSSAT-logs.tgz > > This tarball contains the following: > > DSSAT-logs/sites.grid-ps.xml > DSSAT-logs/tc-provider-staging > DSSAT-logs/cf.ps > DSSAT-logs/RunDSSAT.swift > > Condor, swift logs > > DSSAT-logs/condor.log > DSSAT-logs/swift.log > > Service and worker's stdouts > > DSSAT-logs/service-0.out > DSSAT-logs/swift-workers.out > > Three runlogs since the run was resumed twice: > > DSSAT-logs/RunDSSAT-20110909-1025-hjcelum9.log > DSSAT-logs/RunDSSAT-20110909-1030-jjefp0sb.log > DSSAT-logs/RunDSSAT-20110909-0918-0hk7ign5.log > > Any insights would be helpful. > > Regards, > -- > Ketan > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Sep 9 21:38:34 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 09 Sep 2011 19:38:34 -0700 Subject: [Swift-devel] Persistent coasters on OSG Swift not getting started cores In-Reply-To: References: Message-ID: <1315622314.4583.3.camel@blabla> There seem to be lots of errors in that log, but a lot of them have to do with workers failing for unknown reasons. This is no different than what you mentioned before. So we really need to troubleshoot that. So please enable worker logging and collect worker logs. On Fri, 2011-09-09 at 11:52 -0500, Ketan Maheshwari wrote: > Hi Mihael, All, > > > I am trying to run the DSSAT workflow, a simple one process catsn-like > loop. > > > The setup on OSG is persisten coasters based with the following > elements: > > > 1. A coaster service is started on the head node > 2. Workers are started on OSG sites. I am using 11 OSG sites. > 3. The workers are submitted in the form of condor jobs which connect > back to the service running at the headnode. > 4. In the current instance that I am running, 500 workers are > submitted to start, out of which 280 workers are in running state as > of now. > > > My throttles: jobthrottle, foreach throttle are set to run 500 tasks > at a time. > > > However, I am seeing a see-saw pattern of active tasks whose peak is > very low. What I am seeing is: the number of active tasks start rising > gradually from 0 to about 30 followed by a decrease from 30 to 0 and > back to 30. > > > The logs and sources are > at : http://ci.uchicago.edu/~ketan/DSSAT-logs.tgz > > > This tarball contains the following: > > > DSSAT-logs/sites.grid-ps.xml > DSSAT-logs/tc-provider-staging > DSSAT-logs/cf.ps > DSSAT-logs/RunDSSAT.swift > > > Condor, swift logs > > > DSSAT-logs/condor.log > DSSAT-logs/swift.log > > > Service and worker's stdouts > > > DSSAT-logs/service-0.out > DSSAT-logs/swift-workers.out > > > Three runlogs since the run was resumed twice: > > > DSSAT-logs/RunDSSAT-20110909-1025-hjcelum9.log > DSSAT-logs/RunDSSAT-20110909-1030-jjefp0sb.log > DSSAT-logs/RunDSSAT-20110909-0918-0hk7ign5.log > > > Any insights would be helpful. > > > Regards, > -- > Ketan > > > From wilde at mcs.anl.gov Sat Sep 10 09:30:34 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 10 Sep 2011 09:30:34 -0500 (CDT) Subject: [Swift-devel] Persistent coasters on OSG Swift not getting started cores In-Reply-To: <1315622314.4583.3.camel@blabla> Message-ID: <1331118662.292879.1315665034215.JavaMail.root@zimbra.anl.gov> Mihael, I agree with your assessment. Ketan, to enable worker logs: the run-worker.sh script tries to to this. You need to verify that it is correctly setting worker.pl to log. For efficiency it places the worker log on the worker's local tmp filesystem. The trick is getting the log file back via Condor-G. The current run-worker.sh script tails the worker log (and any other files that happen to get created in the log dir) to stdout for Conndor to ship it back. You should adjust run-worker.sh to ship the log file back in its entirety (or just increase the tail to something much larger). - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Ketan Maheshwari" > Cc: "Swift Devel" > Sent: Friday, September 9, 2011 9:38:34 PM > Subject: Re: [Swift-devel] Persistent coasters on OSG Swift not getting started cores > There seem to be lots of errors in that log, but a lot of them have to > do with workers failing for unknown reasons. > > This is no different than what you mentioned before. So we really need > to troubleshoot that. So please enable worker logging and collect > worker > logs. > > On Fri, 2011-09-09 at 11:52 -0500, Ketan Maheshwari wrote: > > Hi Mihael, All, > > > > > > I am trying to run the DSSAT workflow, a simple one process > > catsn-like > > loop. > > > > > > The setup on OSG is persisten coasters based with the following > > elements: > > > > > > 1. A coaster service is started on the head node > > 2. Workers are started on OSG sites. I am using 11 OSG sites. > > 3. The workers are submitted in the form of condor jobs which > > connect > > back to the service running at the headnode. > > 4. In the current instance that I am running, 500 workers are > > submitted to start, out of which 280 workers are in running state as > > of now. > > > > > > My throttles: jobthrottle, foreach throttle are set to run 500 tasks > > at a time. > > > > > > However, I am seeing a see-saw pattern of active tasks whose peak is > > very low. What I am seeing is: the number of active tasks start > > rising > > gradually from 0 to about 30 followed by a decrease from 30 to 0 and > > back to 30. > > > > > > The logs and sources are > > at : http://ci.uchicago.edu/~ketan/DSSAT-logs.tgz > > > > > > This tarball contains the following: > > > > > > DSSAT-logs/sites.grid-ps.xml > > DSSAT-logs/tc-provider-staging > > DSSAT-logs/cf.ps > > DSSAT-logs/RunDSSAT.swift > > > > > > Condor, swift logs > > > > > > DSSAT-logs/condor.log > > DSSAT-logs/swift.log > > > > > > Service and worker's stdouts > > > > > > DSSAT-logs/service-0.out > > DSSAT-logs/swift-workers.out > > > > > > Three runlogs since the run was resumed twice: > > > > > > DSSAT-logs/RunDSSAT-20110909-1025-hjcelum9.log > > DSSAT-logs/RunDSSAT-20110909-1030-jjefp0sb.log > > DSSAT-logs/RunDSSAT-20110909-0918-0hk7ign5.log > > > > > > Any insights would be helpful. > > > > > > Regards, > > -- > > Ketan > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jonmon at mcs.anl.gov Sun Sep 11 20:02:54 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sun, 11 Sep 2011 20:02:54 -0500 Subject: [Swift-devel] Scheduler Question Message-ID: <8C354D09-FAEC-4E3F-A305-349FF64DA1D5@mcs.anl.gov> Does the scheduler take into account data affinity when making a decision on what site to run an app on? For instance if app1 produces an output file on PADS and app2 can run on both Ranger and PADS will the scheduler give preference to run on PADS over Ranger? I thought that Mihael mentioned this at one point but just want to make sure. From jonmon at mcs.anl.gov Sun Sep 11 20:02:54 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Sun, 11 Sep 2011 20:02:54 -0500 Subject: [Swift-devel] Scheduler Question Message-ID: Does the scheduler take into account data affinity when making a decision on what site to run an app on? For instance if app1 produces an output file on PADS and app2 can run on both Ranger and PADS will the scheduler give preference to run on PADS over Ranger? I thought that Mihael mentioned this at one point but just want to make sure. From ketancmaheshwari at gmail.com Mon Sep 12 11:58:44 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 12 Sep 2011 11:58:44 -0500 Subject: [Swift-devel] persistent coasters and data staging Message-ID: Hi Mihael, Owing to the issues we were facing with OSG persistent coasters setup, I have been doing some experiments. Since, apparently the issues were related to data stagings, I conducted some experiment aiming to study the staging of data from a local client to the OSG sites. The description of my experiments is as follows: I performed about 40 runs from Bridled client to OSG sites using persistent coasters based setup. Each run (catsn) consisted of 100 tasks and a fixed data size per task. I increased the data size gradually from 0MB(20 bytes) to 10MB for successive runs. 15 runs were successful, 11 were partially successful (upto 25% tasks completed and rest failed owing to data staging timeout). 14 runs failed fully, after which I had to lower the throttle value ( the jobthrottle and foreach throttle, implying the number of data staging done in parallel) after which they succeeded. The data ranged from 0 to 10MB per task. 12 runs were performed using the local /scratch directory as source of the data and destination of the results. 14 runs involved /gpfs/pads as source and destination of data and results respectively. The results are summarized here: https://docs.google.com/spreadsheet/ccc?key=0AmvYSwENKFY9dHpuM1NQQlZ5VS1idGs2M0hsbDFCa0E&hl=en_US Sheet 2 contains the table summarizing parameters used in the run. The green rows correspond to successful runs while orange ones correspond to partial or failed runs. Sheet 3 shows a histogram of time versus data size for the successful runs only. The key trend that I observe from these runs is that the data staging does not really get very well as the size of data increases vis-a-vis the throttle. At the stage of 8MB and 10MB data sizes, I had to decrease throttle to 10 in order to get successful runs. After some discussion with Mike, Our conclusion from these runs was that the parallel data transfers are causing timeouts from the worker.pl, further, we were undecided if somehow the timeout threshold is set too agressive plus how are they determined and whether a change in that value could resolve the issue. The runs, sources, swift and service logs, and log ids as shown in the last column are all available at : http://mcs.anl.gov/~ketan/catsn-condor.tgz The last 1000 lines of the worker logs are logged in the condor directory in the above tarball condor/n.err, condor/n.out. However, I do not think the workers error is an issue here since for each run, I made sure a healthy number of workers are running. Regards, -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Sep 12 13:56:19 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 12 Sep 2011 11:56:19 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: References: Message-ID: <1315853779.19354.5.camel@blabla> On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari wrote: > After some discussion with Mike, Our conclusion from these runs was > that the parallel data transfers are causing timeouts from the > worker.pl, further, we were undecided if somehow the timeout threshold > is set too agressive plus how are they determined and whether a change > in that value could resolve the issue. Something like that. Worker.pl would use the time when a file transfer started to determine timeouts. This is undesirable. The purpose of timeouts is to determine whether the other side has stopped from properly following the flow of things. It follows that any kind of activity should reset the timeout... timer. I updated the worker code to deal with the issue in a proper way. But now I need your help. This is perl code, and it needs testing. So can you re-run, first with some simple test that uses coaster staging (just to make sure I didn't mess something up), and then the version of your tests that was most likely to fail? From ketancmaheshwari at gmail.com Mon Sep 12 14:52:03 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 12 Sep 2011 14:52:03 -0500 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1315853779.19354.5.camel@blabla> References: <1315853779.19354.5.camel@blabla> Message-ID: Mihael, Have you just changed the worker.pl. Asking since I am using a mix of trunk and 0.93. worker.pl that I am using is 0.93 .. If only this is changed. I will use it from trunk. On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan wrote: > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari wrote: > > > After some discussion with Mike, Our conclusion from these runs was > > that the parallel data transfers are causing timeouts from the > > worker.pl, further, we were undecided if somehow the timeout threshold > > is set too agressive plus how are they determined and whether a change > > in that value could resolve the issue. > > Something like that. Worker.pl would use the time when a file transfer > started to determine timeouts. This is undesirable. The purpose of > timeouts is to determine whether the other side has stopped from > properly following the flow of things. It follows that any kind of > activity should reset the timeout... timer. > > I updated the worker code to deal with the issue in a proper way. But > now I need your help. This is perl code, and it needs testing. > > So can you re-run, first with some simple test that uses coaster staging > (just to make sure I didn't mess something up), and then the version of > your tests that was most likely to fail? > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Sep 12 15:01:42 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 12 Sep 2011 13:01:42 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: References: <1315853779.19354.5.camel@blabla> Message-ID: <1315857702.20809.3.camel@blabla> On Mon, 2011-09-12 at 14:52 -0500, Ketan Maheshwari wrote: > Mihael, > > > Have you just changed the worker.pl. Asking since I am using a mix of > trunk and 0.93. > > > worker.pl that I am using is 0.93 .. If only this is changed. I will > use it from trunk. Sorry, but randomly mixing branches means I don't know what code you're actually using. Please don't mix things. Or if you do, please be very specific about what you're using from where. The worker.pl fix went into the 0.93 branch. > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > wrote: > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari wrote: > > > After some discussion with Mike, Our conclusion from these > runs was > > that the parallel data transfers are causing timeouts from > the > > worker.pl, further, we were undecided if somehow the timeout > threshold > > is set too agressive plus how are they determined and > whether a change > > in that value could resolve the issue. > > > Something like that. Worker.pl would use the time when a file > transfer > started to determine timeouts. This is undesirable. The > purpose of > timeouts is to determine whether the other side has stopped > from > properly following the flow of things. It follows that any > kind of > activity should reset the timeout... timer. > > I updated the worker code to deal with the issue in a proper > way. But > now I need your help. This is perl code, and it needs testing. > > So can you re-run, first with some simple test that uses > coaster staging > (just to make sure I didn't mess something up), and then the > version of > your tests that was most likely to fail? > > > > > > -- > Ketan > > > From ketancmaheshwari at gmail.com Mon Sep 12 15:56:29 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 12 Sep 2011 15:56:29 -0500 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1315853779.19354.5.camel@blabla> References: <1315853779.19354.5.camel@blabla> Message-ID: Mihael, I tried with the new worker.pl, running a 100 task 10MB per task run with throttle set at 100. However, it seems to have failed with the same symptoms of timeout error 521: Caused by: null Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 521 Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 Submitted:53 Active:1 Failed:46 Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 Submitted:53 Active:1 Failed:46 Exception in cat: Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt] Host: grid Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk - - - Caused by: null Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 521 Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 Submitted:52 Active:1 Failed:47 Exception in cat: Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt] Host: grid Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk I had about 107 workers running at the time of these failures. I started seeing the failure messages after about 20 minutes into this run. The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz Regards, Ketan On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan wrote: > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari wrote: > > > After some discussion with Mike, Our conclusion from these runs was > > that the parallel data transfers are causing timeouts from the > > worker.pl, further, we were undecided if somehow the timeout threshold > > is set too agressive plus how are they determined and whether a change > > in that value could resolve the issue. > > Something like that. Worker.pl would use the time when a file transfer > started to determine timeouts. This is undesirable. The purpose of > timeouts is to determine whether the other side has stopped from > properly following the flow of things. It follows that any kind of > activity should reset the timeout... timer. > > I updated the worker code to deal with the issue in a proper way. But > now I need your help. This is perl code, and it needs testing. > > So can you re-run, first with some simple test that uses coaster staging > (just to make sure I didn't mess something up), and then the version of > your tests that was most likely to fail? > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Sep 12 16:21:40 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 12 Sep 2011 14:21:40 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: References: <1315853779.19354.5.camel@blabla> Message-ID: <1315862500.21297.0.camel@blabla> Ok. I see the same problem in the service code. I'm working on a fix. On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote: > Mihael, > > > I tried with the new worker.pl, running a 100 task 10MB per task run > with throttle set at 100. > > > However, it seems to have failed with the same symptoms of timeout > error 521: > > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.execution.JobException: Job > failed with an exit code of 521 > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 Submitted:53 > Active:1 Failed:46 > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 Submitted:53 > Active:1 Failed:46 > Exception in cat: > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt] > Host: grid > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk > - - - > > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.execution.JobException: Job > failed with an exit code of 521 > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 Submitted:52 > Active:1 Failed:47 > Exception in cat: > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt] > Host: grid > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk > > > I had about 107 workers running at the time of these failures. > > > I started seeing the failure messages after about 20 minutes into this > run. > > > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz > > > Regards, > Ketan > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > wrote: > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari wrote: > > > After some discussion with Mike, Our conclusion from these > runs was > > that the parallel data transfers are causing timeouts from > the > > worker.pl, further, we were undecided if somehow the timeout > threshold > > is set too agressive plus how are they determined and > whether a change > > in that value could resolve the issue. > > > Something like that. Worker.pl would use the time when a file > transfer > started to determine timeouts. This is undesirable. The > purpose of > timeouts is to determine whether the other side has stopped > from > properly following the flow of things. It follows that any > kind of > activity should reset the timeout... timer. > > I updated the worker code to deal with the issue in a proper > way. But > now I need your help. This is perl code, and it needs testing. > > So can you re-run, first with some simple test that uses > coaster staging > (just to make sure I didn't mess something up), and then the > version of > your tests that was most likely to fail? > > > > > > -- > Ketan > > > From ketancmaheshwari at gmail.com Mon Sep 12 17:01:03 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Mon, 12 Sep 2011 17:01:03 -0500 Subject: [Swift-devel] Path.parse(path.toString()) In-Reply-To: <1133793033.289395.1315578803252.JavaMail.root@zimbra.anl.gov> References: <1315549904.7449.1.camel@blabla> <1133793033.289395.1315578803252.JavaMail.root@zimbra.anl.gov> Message-ID: Hello, Today I saw this error on SCEC workflow. could not find variable: _concurrent/var_str-d9e801ee-8aca-431f-99a0-8286c452c779-13-6 [7] Invalid path exception org.griphyn.vdl.mapping.InvalidPathException: Invalid path ([7]) for var_str:string[0] - Closed for path [7] org.griphyn.vdl.mapping.InvalidPathException: Invalid path ([7]) for var_str:string[0] - Closed at org.griphyn.vdl.mapping.AbstractDataNode.getField(AbstractDataNode.java:204) at org.griphyn.vdl.mapping.file.ArrayFileMapper.map(ArrayFileMapper.java:55) at org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:281) at org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:270) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:187) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:175) at org.griphyn.vdl.karajan.lib.Stagein.function(Stagein.java:54) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.futureModified(AbstractSequentialWithArguments.java:208) at org.griphyn.vdl.karajan.DSHandleFutureWrapper$1.run(DSHandleFutureWrapper.java:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Execution failed: Progress: time: Mon, 12 Sep 2011 16:50:27 -0500 Submitted:1 Active:17 Finished successfully:10 Complete log for this run is: http://www.ci.uchicago.edu/~ketan/postproc-20110912-1642-epj1ebm1.log Swift is r5102. Regards, Ketan On Fri, Sep 9, 2011 at 9:33 AM, Michael Wilde wrote: > So far Im not able to re-create this problem on the prior revision of 0.93. > > Im currently using this variation of the script Mihael posted below: > > type file; > > type struct { > file b; > file c; > } > > app (struct of) echo() { > sh "-c" "echo bee >fileb ; echo see >filec"; > } > > struct s ; > > s = echo(); > > I discovered what I think is a new bug in the process: if the shell command > string contains the characters >&1 and >&2, Swift complains that it cant > process the intermediate xml. Presumably those shell sequences are getting > xml parsing confused and need to be escaped. > > - Mike > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "David Kelly" > > Cc: "Swift Devel" > > Sent: Friday, September 9, 2011 1:31:44 AM > > Subject: Re: [Swift-devel] Path.parse(path.toString()) > > On Fri, 2011-09-09 at 01:10 -0500, David Kelly wrote: > > > It looks like there are a few scripts in the test suite which write > > > structs and seem to be passing. > > > > I suspect they are not staging structs out (just in). > > > > > Do you happen to have an example script I could add to the suite? > > > > type file; > > > > type struct { > > file b; > > file c; > > } > > > > app (struct of) echo() { > > bash "echo bee >1 ; echo see >2" stdout=@filename(of.b) > > stderr=@filename(of.c); > > } > > > > struct s ; > > > > s = echo(); > > > > (I'm not sure if this actually works if the bug is fixed, but I know > > it > > doesn't if it isn't. Anyway, you get the idea). > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Mon Sep 12 18:04:29 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 12 Sep 2011 18:04:29 -0500 (CDT) Subject: [Swift-devel] Deadlocks running ParVis script under 0.93 In-Reply-To: <119569455.297769.1315867991526.JavaMail.root@zimbra.anl.gov> Message-ID: <554803890.297804.1315868669697.JavaMail.root@zimbra.anl.gov> Mihael, Im getting Java-level deadlocks in running a ParVis script. The user is seeming this as well (on at least one instance). Im running on Fusion in the directory /home/wilde/amwg/run01. The script is complex, it should run >325 app calls (Im not yet sure how many - Im guessing at least 100 more). The two runs (swift work dirs) that show the deadlocks are: fusion$ ls -lt */jstack.out -rw-r--r-- 1 wilde mcsz 80249 Sep 12 17:39 amwg_stats-20110912-1546-5aqzkvhe/jstack.out -rw-r--r-- 1 wilde mcsz 135539 Sep 11 10:38 amwg_stats-20110911-1033-fd1brig2/jstack.out fusion$ The log files are in the run01 directory. The Swift stdout progress logs are in the top of the respective swift work dirs. One of my runs, amwg_stats-20110912-1546-5aqzkvhe, (as well as one of the user's runs) hung after 323 app calls with this Java deadlock: Found one Java-level deadlock: ============================= "pool-1-thread-32": waiting to lock monitor 0x000000005ccf97b0 (object 0x00002aaab56f0e30, a org.griphyn.vdl.mapping.RootDataNode), which is held by "pool-1-thread-11" "pool-1-thread-11": waiting to lock monitor 0x000000005ce1cad8 (object 0x00002aaab5a50768, a org.griphyn.vdl.karajan.DSHandleFutureWrapper), which is held by "pool-1-thread-32" Java stack information for the threads listed above: =================================================== "pool-1-thread-32": at org.griphyn.vdl.karajan.lib.SwiftArg.unwrap(SwiftArg.java:52) - waiting to lock <0x00002aaab56f0e30> (a org.griphyn.vdl.mapping.RootDataNode) at org.griphyn.vdl.karajan.lib.SwiftArg$Vargs.asArray(SwiftArg.java:177) at org.griphyn.vdl.karajan.lib.swiftscript.Misc.swiftscript_strcat(Misc.java:82) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:82) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) ... "pool-1-thread-11": at org.griphyn.vdl.karajan.DSHandleFutureWrapper.notifyListeners(DSHandleFutureWrapper.java:68) - waiting to lock <0x00002aaab5a50768> (a org.griphyn.vdl.karajan.DSHandleFutureWrapper) at org.griphyn.vdl.karajan.DSHandleFutureWrapper.handleClosed(DSHandleFutureWrapper.java:122) at org.griphyn.vdl.mapping.AbstractDataNode.notifyListeners(AbstractDataNode.java:605) at org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:408) - locked <0x00002aaab56f0e30> (a org.griphyn.vdl.mapping.RootDataNode) at org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:358) at org.griphyn.vdl.mapping.RootDataNode.setValue(RootDataNode.java:227) at org.griphyn.vdl.karajan.lib.SetFieldValue.deepCopy(SetFieldValue.java:90) at org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:49) - locked <0x00002aaab56f0e30> (a org.griphyn.vdl.mapping.RootDataNode) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) --- The other script hung after only about 20 app calls with two deadlocks, one of which is: Found one Java-level deadlock: ============================= "pool-1-thread-32": waiting to lock monitor 0x000000005b7a1da8 (object 0x00002aaab46d7490, a org.griphyn.vdl.mapping.RootDataNode), which is held by "pool-1-thread-26" "pool-1-thread-26": waiting to lock monitor 0x000000005b5e6620 (object 0x00002aaab46d63d0, a org.griphyn.vdl.karajan.WrapperMap), which is held by "pool-1-thread-10" "pool-1-thread-10": waiting to lock monitor 0x000000005b4e47e0 (object 0x00002aaac2015f90, a org.griphyn.vdl.mapping.RootArrayDataNode), which is held by "pool-1-thread-15" "pool-1-thread-15": waiting to lock monitor 0x000000005b5e6620 (object 0x00002aaab46d63d0, a org.griphyn.vdl.karajan.WrapperMap), which is held by "pool-1-thread-10" --- -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Mon Sep 12 19:19:13 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 12 Sep 2011 17:19:13 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: References: <1315853779.19354.5.camel@blabla> Message-ID: <1315873153.2945.0.camel@blabla> Try now please (cog r3262). On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote: > Mihael, > > > I tried with the new worker.pl, running a 100 task 10MB per task run > with throttle set at 100. > > > However, it seems to have failed with the same symptoms of timeout > error 521: > > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.execution.JobException: Job > failed with an exit code of 521 > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 Submitted:53 > Active:1 Failed:46 > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 Submitted:53 > Active:1 Failed:46 > Exception in cat: > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt] > Host: grid > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk > - - - > > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.execution.JobException: Job > failed with an exit code of 521 > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 Submitted:52 > Active:1 Failed:47 > Exception in cat: > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt] > Host: grid > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk > > > I had about 107 workers running at the time of these failures. > > > I started seeing the failure messages after about 20 minutes into this > run. > > > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz > > > Regards, > Ketan > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > wrote: > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari wrote: > > > After some discussion with Mike, Our conclusion from these > runs was > > that the parallel data transfers are causing timeouts from > the > > worker.pl, further, we were undecided if somehow the timeout > threshold > > is set too agressive plus how are they determined and > whether a change > > in that value could resolve the issue. > > > Something like that. Worker.pl would use the time when a file > transfer > started to determine timeouts. This is undesirable. The > purpose of > timeouts is to determine whether the other side has stopped > from > properly following the flow of things. It follows that any > kind of > activity should reset the timeout... timer. > > I updated the worker code to deal with the issue in a proper > way. But > now I need your help. This is perl code, and it needs testing. > > So can you re-run, first with some simple test that uses > coaster staging > (just to make sure I didn't mess something up), and then the > version of > your tests that was most likely to fail? > > > > > > -- > Ketan > > > From hategan at mcs.anl.gov Mon Sep 12 19:29:15 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 12 Sep 2011 17:29:15 -0700 Subject: [Swift-devel] Deadlocks running ParVis script under 0.93 In-Reply-To: <554803890.297804.1315868669697.JavaMail.root@zimbra.anl.gov> References: <554803890.297804.1315868669697.JavaMail.root@zimbra.anl.gov> Message-ID: <1315873755.3024.3.camel@blabla> On Mon, 2011-09-12 at 18:04 -0500, Michael Wilde wrote: [...] > at org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:82) > at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > ... > "pool-1-thread-11": > at org.griphyn.vdl.karajan.DSHandleFutureWrapper.notifyListeners(DSHandleFutureWrapper.java:68) > - waiting to lock <0x00002aaab5a50768> (a org.griphyn.vdl.karajan.DSHandleFutureWrapper) > Could I please have the whole jstack output? The dots above "pool-1-thread-11" contain some important information. Mihael From hategan at mcs.anl.gov Mon Sep 12 19:53:33 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 12 Sep 2011 17:53:33 -0700 Subject: [Swift-devel] Deadlocks running ParVis script under 0.93 In-Reply-To: <554803890.297804.1315868669697.JavaMail.root@zimbra.anl.gov> References: <554803890.297804.1315868669697.JavaMail.root@zimbra.anl.gov> Message-ID: <1315875213.3366.0.camel@blabla> This is a rather old 0.93. One of the deadlocks was fixed a while ago. I'm investigating the other one. On Mon, 2011-09-12 at 18:04 -0500, Michael Wilde wrote: > Mihael, Im getting Java-level deadlocks in running a ParVis script. The user is seeming this as well (on at least one instance). > > Im running on Fusion in the directory /home/wilde/amwg/run01. > > The script is complex, it should run >325 app calls (Im not yet sure how many - Im guessing at least 100 more). > > The two runs (swift work dirs) that show the deadlocks are: > > fusion$ ls -lt */jstack.out > -rw-r--r-- 1 wilde mcsz 80249 Sep 12 17:39 amwg_stats-20110912-1546-5aqzkvhe/jstack.out > -rw-r--r-- 1 wilde mcsz 135539 Sep 11 10:38 amwg_stats-20110911-1033-fd1brig2/jstack.out > fusion$ > > The log files are in the run01 directory. The Swift stdout progress logs are in the top of the respective swift work dirs. > > One of my runs, amwg_stats-20110912-1546-5aqzkvhe, (as well as one of the user's runs) hung after 323 app calls with this Java deadlock: > > Found one Java-level deadlock: > ============================= > "pool-1-thread-32": > waiting to lock monitor 0x000000005ccf97b0 (object 0x00002aaab56f0e30, a org.griphyn.vdl.mapping.RootDataNode), > which is held by "pool-1-thread-11" > "pool-1-thread-11": > waiting to lock monitor 0x000000005ce1cad8 (object 0x00002aaab5a50768, a org.griphyn.vdl.karajan.DSHandleFutureWrapper), > which is held by "pool-1-thread-32" > > Java stack information for the threads listed above: > =================================================== > "pool-1-thread-32": > at org.griphyn.vdl.karajan.lib.SwiftArg.unwrap(SwiftArg.java:52) > - waiting to lock <0x00002aaab56f0e30> (a org.griphyn.vdl.mapping.RootDataNode) > at org.griphyn.vdl.karajan.lib.SwiftArg$Vargs.asArray(SwiftArg.java:177) > at org.griphyn.vdl.karajan.lib.swiftscript.Misc.swiftscript_strcat(Misc.java:82) > at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:82) > at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > ... > "pool-1-thread-11": > at org.griphyn.vdl.karajan.DSHandleFutureWrapper.notifyListeners(DSHandleFutureWrapper.java:68) > - waiting to lock <0x00002aaab5a50768> (a org.griphyn.vdl.karajan.DSHandleFutureWrapper) > at org.griphyn.vdl.karajan.DSHandleFutureWrapper.handleClosed(DSHandleFutureWrapper.java:122) > at org.griphyn.vdl.mapping.AbstractDataNode.notifyListeners(AbstractDataNode.java:605) > at org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:408) > - locked <0x00002aaab56f0e30> (a org.griphyn.vdl.mapping.RootDataNode) > at org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:358) > at org.griphyn.vdl.mapping.RootDataNode.setValue(RootDataNode.java:227) > at org.griphyn.vdl.karajan.lib.SetFieldValue.deepCopy(SetFieldValue.java:90) > at org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:49) > - locked <0x00002aaab56f0e30> (a org.griphyn.vdl.mapping.RootDataNode) > at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > > --- > The other script hung after only about 20 app calls with two deadlocks, one of which is: > > Found one Java-level deadlock: > ============================= > "pool-1-thread-32": > waiting to lock monitor 0x000000005b7a1da8 (object 0x00002aaab46d7490, a org.griphyn.vdl.mapping.RootDataNode), > which is held by "pool-1-thread-26" > "pool-1-thread-26": > waiting to lock monitor 0x000000005b5e6620 (object 0x00002aaab46d63d0, a org.griphyn.vdl.karajan.WrapperMap), > which is held by "pool-1-thread-10" > "pool-1-thread-10": > waiting to lock monitor 0x000000005b4e47e0 (object 0x00002aaac2015f90, a org.griphyn.vdl.mapping.RootArrayDataNode), > which is held by "pool-1-thread-15" > "pool-1-thread-15": > waiting to lock monitor 0x000000005b5e6620 (object 0x00002aaab46d63d0, a org.griphyn.vdl.karajan.WrapperMap), > which is held by "pool-1-thread-10" > > --- > > From hategan at mcs.anl.gov Mon Sep 12 19:55:32 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 12 Sep 2011 17:55:32 -0700 Subject: [Swift-devel] Deadlocks running ParVis script under 0.93 In-Reply-To: <1315875213.3366.0.camel@blabla> References: <554803890.297804.1315868669697.JavaMail.root@zimbra.anl.gov> <1315875213.3366.0.camel@blabla> Message-ID: <1315875332.3366.1.camel@blabla> On Mon, 2011-09-12 at 17:53 -0700, Mihael Hategan wrote: > This is a rather old 0.93. One of the deadlocks was fixed a while ago. And so was the other one. > I'm investigating the other one. > > On Mon, 2011-09-12 at 18:04 -0500, Michael Wilde wrote: > > Mihael, Im getting Java-level deadlocks in running a ParVis script. The user is seeming this as well (on at least one instance). > > > > Im running on Fusion in the directory /home/wilde/amwg/run01. > > > > The script is complex, it should run >325 app calls (Im not yet sure how many - Im guessing at least 100 more). > > > > The two runs (swift work dirs) that show the deadlocks are: > > > > fusion$ ls -lt */jstack.out > > -rw-r--r-- 1 wilde mcsz 80249 Sep 12 17:39 amwg_stats-20110912-1546-5aqzkvhe/jstack.out > > -rw-r--r-- 1 wilde mcsz 135539 Sep 11 10:38 amwg_stats-20110911-1033-fd1brig2/jstack.out > > fusion$ > > > > The log files are in the run01 directory. The Swift stdout progress logs are in the top of the respective swift work dirs. > > > > One of my runs, amwg_stats-20110912-1546-5aqzkvhe, (as well as one of the user's runs) hung after 323 app calls with this Java deadlock: > > > > Found one Java-level deadlock: > > ============================= > > "pool-1-thread-32": > > waiting to lock monitor 0x000000005ccf97b0 (object 0x00002aaab56f0e30, a org.griphyn.vdl.mapping.RootDataNode), > > which is held by "pool-1-thread-11" > > "pool-1-thread-11": > > waiting to lock monitor 0x000000005ce1cad8 (object 0x00002aaab5a50768, a org.griphyn.vdl.karajan.DSHandleFutureWrapper), > > which is held by "pool-1-thread-32" > > > > Java stack information for the threads listed above: > > =================================================== > > "pool-1-thread-32": > > at org.griphyn.vdl.karajan.lib.SwiftArg.unwrap(SwiftArg.java:52) > > - waiting to lock <0x00002aaab56f0e30> (a org.griphyn.vdl.mapping.RootDataNode) > > at org.griphyn.vdl.karajan.lib.SwiftArg$Vargs.asArray(SwiftArg.java:177) > > at org.griphyn.vdl.karajan.lib.swiftscript.Misc.swiftscript_strcat(Misc.java:82) > > at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) > > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:82) > > at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > > ... > > "pool-1-thread-11": > > at org.griphyn.vdl.karajan.DSHandleFutureWrapper.notifyListeners(DSHandleFutureWrapper.java:68) > > - waiting to lock <0x00002aaab5a50768> (a org.griphyn.vdl.karajan.DSHandleFutureWrapper) > > at org.griphyn.vdl.karajan.DSHandleFutureWrapper.handleClosed(DSHandleFutureWrapper.java:122) > > at org.griphyn.vdl.mapping.AbstractDataNode.notifyListeners(AbstractDataNode.java:605) > > at org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:408) > > - locked <0x00002aaab56f0e30> (a org.griphyn.vdl.mapping.RootDataNode) > > at org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:358) > > at org.griphyn.vdl.mapping.RootDataNode.setValue(RootDataNode.java:227) > > at org.griphyn.vdl.karajan.lib.SetFieldValue.deepCopy(SetFieldValue.java:90) > > at org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:49) > > - locked <0x00002aaab56f0e30> (a org.griphyn.vdl.mapping.RootDataNode) > > at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > > at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > > > > > --- > > The other script hung after only about 20 app calls with two deadlocks, one of which is: > > > > Found one Java-level deadlock: > > ============================= > > "pool-1-thread-32": > > waiting to lock monitor 0x000000005b7a1da8 (object 0x00002aaab46d7490, a org.griphyn.vdl.mapping.RootDataNode), > > which is held by "pool-1-thread-26" > > "pool-1-thread-26": > > waiting to lock monitor 0x000000005b5e6620 (object 0x00002aaab46d63d0, a org.griphyn.vdl.karajan.WrapperMap), > > which is held by "pool-1-thread-10" > > "pool-1-thread-10": > > waiting to lock monitor 0x000000005b4e47e0 (object 0x00002aaac2015f90, a org.griphyn.vdl.mapping.RootArrayDataNode), > > which is held by "pool-1-thread-15" > > "pool-1-thread-15": > > waiting to lock monitor 0x000000005b5e6620 (object 0x00002aaab46d63d0, a org.griphyn.vdl.karajan.WrapperMap), > > which is held by "pool-1-thread-10" > > > > --- > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Mon Sep 12 20:02:50 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 12 Sep 2011 18:02:50 -0700 Subject: [Swift-devel] Path.parse(path.toString()) In-Reply-To: References: <1315549904.7449.1.camel@blabla> <1133793033.289395.1315578803252.JavaMail.root@zimbra.anl.gov> Message-ID: <1315875770.3473.2.camel@blabla> It's unrelated to the Path.parse issue. var_str has no elements, but the code is assuming that it does. On Mon, 2011-09-12 at 17:01 -0500, Ketan Maheshwari wrote: > Hello, > > > Today I saw this error on SCEC workflow. > > > > > could not find variable: > _concurrent/var_str-d9e801ee-8aca-431f-99a0-8286c452c779-13-6 [7] > Invalid path exception org.griphyn.vdl.mapping.InvalidPathException: > Invalid path ([7]) for var_str:string[0] - Closed for path [7] > org.griphyn.vdl.mapping.InvalidPathException: Invalid path ([7]) for > var_str:string[0] - Closed > at > org.griphyn.vdl.mapping.AbstractDataNode.getField(AbstractDataNode.java:204) > at > org.griphyn.vdl.mapping.file.ArrayFileMapper.map(ArrayFileMapper.java:55) > at > org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:281) > at > org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:270) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:187) > at > org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:175) > at org.griphyn.vdl.karajan.lib.Stagein.function(Stagein.java:54) > at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) > at > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.futureModified(AbstractSequentialWithArguments.java:208) > at org.griphyn.vdl.karajan.DSHandleFutureWrapper > $1.run(DSHandleFutureWrapper.java:63) > at java.util.concurrent.Executors > $RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at java.util.concurrent.ThreadPoolExecutor > $Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor > $Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > Execution failed: > Progress: time: Mon, 12 Sep 2011 16:50:27 -0500 Submitted:1 > Active:17 Finished successfully:10 > > > > > Complete log for this run is: > http://www.ci.uchicago.edu/~ketan/postproc-20110912-1642-epj1ebm1.log > > > Swift is r5102. > > > Regards, > Ketan > > On Fri, Sep 9, 2011 at 9:33 AM, Michael Wilde > wrote: > So far Im not able to re-create this problem on the prior > revision of 0.93. > > Im currently using this variation of the script Mihael posted > below: > > type file; > > type struct { > file b; > file c; > } > > app (struct of) echo() { > > sh "-c" "echo bee >fileb ; echo see >filec"; > } > > struct s ; > > s = echo(); > > I discovered what I think is a new bug in the process: if the > shell command string contains the characters >&1 and >&2, > Swift complains that it cant process the intermediate xml. > Presumably those shell sequences are getting xml parsing > confused and need to be escaped. > > - Mike > > ----- Original Message ----- > > From: "Mihael Hategan" > > > > To: "David Kelly" > > Cc: "Swift Devel" > > Sent: Friday, September 9, 2011 1:31:44 AM > > Subject: Re: [Swift-devel] Path.parse(path.toString()) > > On Fri, 2011-09-09 at 01:10 -0500, David Kelly wrote: > > > It looks like there are a few scripts in the test suite > which write > > > structs and seem to be passing. > > > > I suspect they are not staging structs out (just in). > > > > > Do you happen to have an example script I could add to > the suite? > > > > type file; > > > > type struct { > > file b; > > file c; > > } > > > > app (struct of) echo() { > > bash "echo bee >1 ; echo see >2" stdout=@filename(of.b) > > stderr=@filename(of.c); > > } > > > > struct s ; > > > > s = echo(); > > > > (I'm not sure if this actually works if the bug is fixed, > but I know > > it > > doesn't if it isn't. Anyway, you get the idea). > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > -- > Ketan > > > From wilde at mcs.anl.gov Mon Sep 12 22:14:02 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 12 Sep 2011 22:14:02 -0500 (CDT) Subject: [Swift-devel] Deadlocks running ParVis script under 0.93 In-Reply-To: <1315875332.3366.1.camel@blabla> Message-ID: <1517106820.298057.1315883642903.JavaMail.root@zimbra.anl.gov> Yes - Im very sorry!!! I was careful to build a fresh 0.93 but obviously did my svn ups on the wrong tree. Am re-testing now. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Monday, September 12, 2011 7:55:32 PM > Subject: Re: [Swift-devel] Deadlocks running ParVis script under 0.93 > On Mon, 2011-09-12 at 17:53 -0700, Mihael Hategan wrote: > > This is a rather old 0.93. One of the deadlocks was fixed a while > > ago. > > And so was the other one. > > > I'm investigating the other one. > > > > On Mon, 2011-09-12 at 18:04 -0500, Michael Wilde wrote: > > > Mihael, Im getting Java-level deadlocks in running a ParVis > > > script. The user is seeming this as well (on at least one > > > instance). > > > > > > Im running on Fusion in the directory /home/wilde/amwg/run01. > > > > > > The script is complex, it should run >325 app calls (Im not yet > > > sure how many - Im guessing at least 100 more). > > > > > > The two runs (swift work dirs) that show the deadlocks are: > > > > > > fusion$ ls -lt */jstack.out > > > -rw-r--r-- 1 wilde mcsz 80249 Sep 12 17:39 > > > amwg_stats-20110912-1546-5aqzkvhe/jstack.out > > > -rw-r--r-- 1 wilde mcsz 135539 Sep 11 10:38 > > > amwg_stats-20110911-1033-fd1brig2/jstack.out > > > fusion$ > > > > > > The log files are in the run01 directory. The Swift stdout > > > progress logs are in the top of the respective swift work dirs. > > > > > > One of my runs, amwg_stats-20110912-1546-5aqzkvhe, (as well as one > > > of the user's runs) hung after 323 app calls with this Java > > > deadlock: > > > > > > Found one Java-level deadlock: > > > ============================= > > > "pool-1-thread-32": > > > waiting to lock monitor 0x000000005ccf97b0 (object > > > 0x00002aaab56f0e30, a org.griphyn.vdl.mapping.RootDataNode), > > > which is held by "pool-1-thread-11" > > > "pool-1-thread-11": > > > waiting to lock monitor 0x000000005ce1cad8 (object > > > 0x00002aaab5a50768, a > > > org.griphyn.vdl.karajan.DSHandleFutureWrapper), > > > which is held by "pool-1-thread-32" > > > > > > Java stack information for the threads listed above: > > > =================================================== > > > "pool-1-thread-32": > > > at > > > org.griphyn.vdl.karajan.lib.SwiftArg.unwrap(SwiftArg.java:52) > > > - waiting to lock <0x00002aaab56f0e30> (a > > > org.griphyn.vdl.mapping.RootDataNode) > > > at > > > org.griphyn.vdl.karajan.lib.SwiftArg$Vargs.asArray(SwiftArg.java:177) > > > at > > > org.griphyn.vdl.karajan.lib.swiftscript.Misc.swiftscript_strcat(Misc.java:82) > > > at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown > > > Source) > > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > at > > > org.globus.cog.karajan.workflow.nodes.functions.FunctionsCollection.function(FunctionsCollection.java:82) > > > at > > > org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:27) > > > ... > > > "pool-1-thread-11": > > > at > > > org.griphyn.vdl.karajan.DSHandleFutureWrapper.notifyListeners(DSHandleFutureWrapper.java:68) > > > - waiting to lock <0x00002aaab5a50768> (a > > > org.griphyn.vdl.karajan.DSHandleFutureWrapper) > > > at > > > org.griphyn.vdl.karajan.DSHandleFutureWrapper.handleClosed(DSHandleFutureWrapper.java:122) > > > at > > > org.griphyn.vdl.mapping.AbstractDataNode.notifyListeners(AbstractDataNode.java:605) > > > at > > > org.griphyn.vdl.mapping.AbstractDataNode.closeShallow(AbstractDataNode.java:408) > > > - locked <0x00002aaab56f0e30> (a > > > org.griphyn.vdl.mapping.RootDataNode) > > > at > > > org.griphyn.vdl.mapping.AbstractDataNode.setValue(AbstractDataNode.java:358) > > > at > > > org.griphyn.vdl.mapping.RootDataNode.setValue(RootDataNode.java:227) > > > at > > > org.griphyn.vdl.karajan.lib.SetFieldValue.deepCopy(SetFieldValue.java:90) > > > at > > > org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:49) > > > - locked <0x00002aaab56f0e30> (a > > > org.griphyn.vdl.mapping.RootDataNode) > > > at > > > org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:67) > > > at > > > org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) > > > at > > > org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) > > > > > > > > > --- > > > The other script hung after only about 20 app calls with two > > > deadlocks, one of which is: > > > > > > Found one Java-level deadlock: > > > ============================= > > > "pool-1-thread-32": > > > waiting to lock monitor 0x000000005b7a1da8 (object > > > 0x00002aaab46d7490, a org.griphyn.vdl.mapping.RootDataNode), > > > which is held by "pool-1-thread-26" > > > "pool-1-thread-26": > > > waiting to lock monitor 0x000000005b5e6620 (object > > > 0x00002aaab46d63d0, a org.griphyn.vdl.karajan.WrapperMap), > > > which is held by "pool-1-thread-10" > > > "pool-1-thread-10": > > > waiting to lock monitor 0x000000005b4e47e0 (object > > > 0x00002aaac2015f90, a > > > org.griphyn.vdl.mapping.RootArrayDataNode), > > > which is held by "pool-1-thread-15" > > > "pool-1-thread-15": > > > waiting to lock monitor 0x000000005b5e6620 (object > > > 0x00002aaab46d63d0, a org.griphyn.vdl.karajan.WrapperMap), > > > which is held by "pool-1-thread-10" > > > > > > --- > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Wed Sep 14 09:19:11 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 14 Sep 2011 09:19:11 -0500 (CDT) Subject: [Swift-devel] Path.parse(path.toString()) In-Reply-To: <1133793033.289395.1315578803252.JavaMail.root@zimbra.anl.gov> Message-ID: <71654589.104201.1316009951734.JavaMail.root@zimbra-mb2.anl.gov> I've tried reproducing this bug in a few different ways with 0.93RC2, but have not been able to. I'm wondering if it is one of those bugs that is dependent upon a certain configuration setting? I have added a variation of this script to the test suite (langauge-behavior/datatypes/056-struct-stage-out.swift). Given the recent commits in the last week or so - the struct staging out issue and coaster fixes, I'm thinking we should set the release to October 1st. This would give us the time to do another full round of testing. It would also give us a chance to put the finishing touches on the website. 0.94 would then have the full time allotted for development and testing before a November 1st release. David ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" > Cc: "Swift Devel" , "David Kelly" > Sent: Friday, September 9, 2011 9:33:23 AM > Subject: Re: [Swift-devel] Path.parse(path.toString()) > So far Im not able to re-create this problem on the prior revision of > 0.93. > > Im currently using this variation of the script Mihael posted below: > > type file; > > type struct { > file b; > file c; > } > > app (struct of) echo() { > sh "-c" "echo bee >fileb ; echo see >filec"; > } > > struct s ; > > s = echo(); > > I discovered what I think is a new bug in the process: if the shell > command string contains the characters >&1 and >&2, Swift complains > that it cant process the intermediate xml. Presumably those shell > sequences are getting xml parsing confused and need to be escaped. > > - Mike > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "David Kelly" > > Cc: "Swift Devel" > > Sent: Friday, September 9, 2011 1:31:44 AM > > Subject: Re: [Swift-devel] Path.parse(path.toString()) > > On Fri, 2011-09-09 at 01:10 -0500, David Kelly wrote: > > > It looks like there are a few scripts in the test suite which > > > write > > > structs and seem to be passing. > > > > I suspect they are not staging structs out (just in). > > > > > Do you happen to have an example script I could add to the suite? > > > > type file; > > > > type struct { > > file b; > > file c; > > } > > > > app (struct of) echo() { > > bash "echo bee >1 ; echo see >2" stdout=@filename(of.b) > > stderr=@filename(of.c); > > } > > > > struct s ; > > > > s = echo(); > > > > (I'm not sure if this actually works if the bug is fixed, but I know > > it > > doesn't if it isn't. Anyway, you get the idea). > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From wilde at mcs.anl.gov Wed Sep 14 14:46:38 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 14 Sep 2011 14:46:38 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: Message-ID: <1230449294.6220.1316029598495.JavaMail.root@zimbra.anl.gov> I think I am seeing a similar deadlock on 0.93 in the ParVis script, and am trying to get a clean log and jstack to confirm. As far as I can tell, Papia is running the correct 0.93 code, but please verify. David will try to replicate this problem as well. - Mike ----- Original Message ----- > From: "Papia Rizwan" > To: "swift-devel Devel" , "Michael Wilde" , "Michael P. Shields" > > Sent: Wednesday, September 14, 2011 1:56:13 PM > Subject: swift 0.93 deadlock > Attached are the jstack output and the log file. > > -- > Papia Rizwan -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From turam at mcs.anl.gov Wed Sep 14 16:39:45 2011 From: turam at mcs.anl.gov (Thomas Uram) Date: Wed, 14 Sep 2011 16:39:45 -0500 Subject: [Swift-devel] User guide version Message-ID: When will the user guide be updated? We've been using import in 0.92 and discovered today that there are useful bits there that are incorrectly documented in the current user guide: http://www.ci.uchicago.edu/swift/guides/userguide.php#imports but correct in the "trunk" user guide: http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_imports Specifically, the new guide correctly states that the import argument should be quoted, and that imported scripts can occur anywhere on the path defined by the SWIFT_LIB environment variable. That last portion is a very useful addition for us. Thanks, Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Wed Sep 14 17:33:21 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 14 Sep 2011 17:33:21 -0500 Subject: [Swift-devel] Cache error in Swift Message-ID: <8591A24C-8842-4FFA-B21A-94D66394CC5E@mcs.anl.gov> Hello, I just ran my SwiftMontage scripts again with the most recent build of 0.93 source. I received this error Execution failed: The cache already contains pads:montage-20110914-1717-nrothtg3/shared/proj_dir/proj_2mass-atlas-000713s-j0870197.fits. I received this error after 3652 tasks. All the files are located in /home/jonmon/PADS/Swift/SwiftMontage/big/run.0013 on the CI machines. I wanted to try replications. I have noticed in my scripts that when PADS is full of jobs Swift doesn't try to resubmit to Beagle even though Beagle shows through showq that it has room for jobs. I thought replications would use both sites more efficiently. What I mean is I thought that replications would replicate a jobs onto Beagle since PADS is taking so long just sitting in the queue. Please educate me if this is not what I should be doing and if in fact there is no work around to this problem. From wilde at mcs.anl.gov Wed Sep 14 17:57:42 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 14 Sep 2011 17:57:42 -0500 (CDT) Subject: [Swift-devel] Intentional change to behavior of high/lowOverAllocation parameter? Message-ID: <562930901.7316.1316041062775.JavaMail.root@zimbra.anl.gov> Mihael, For quite a while now I have been telling people to set the coaster parameters lowOverAllocation and highOverAllocation to 100 to force coasters to set the time allocation of every block to maxTime. (This was based on your advice around the time we started running on Beagle, to enable the user to force a specific and constant PBS job walltime). Cog rev 3225 seems to have introduced a change that insists that these two parameters have a value < 1.0: + checkLessThan("lowOverallocation", 1); + checkLessThan("highOverallocation", 1); If not, the job fails with an exception: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: lowOverallocation must be < 1.0 (currently 100.0) In addition, when I *do* set both parameters to 0.9999 (to try to achieve the same effect as they gave before) then I encounter the phenomenon of coasters starting with a ~10 minute walltime, but my app job doesnt seem to "fit" into the coaster block, and hence Swift just idles making no progress. If I try to reduce my app maxwalltime, then walltime of the PBS job is lowered, but the app job still doesnt "fit" and never gets run. I can send a log if the cause for this not obvious to you. Can you explain what the intended behavior is here, and whether you think the new check (circa Aug 7) introduced a bug? Thanks, - Mike From wilde at mcs.anl.gov Wed Sep 14 18:10:32 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 14 Sep 2011 18:10:32 -0500 (CDT) Subject: [Swift-devel] Intentional change to behavior of high/lowOverAllocation parameter? In-Reply-To: <562930901.7316.1316041062775.JavaMail.root@zimbra.anl.gov> Message-ID: <1776001935.7358.1316041832894.JavaMail.root@zimbra.anl.gov> Mihael, two logs that show the "job never fits in a block" behavior with the overallocations both set to 0.999 are on the CI net at: $ grep 0.999 *.log catsn-20110914-1304-7mgf77k8.log: lowOverallocation = 0.999 catsn-20110914-1304-7mgf77k8.log: highOverallocation = 0.999 catsn-20110914-1326-6ma0ple4.log: lowOverallocation = 0.999 catsn-20110914-1326-6ma0ple4.log: highOverallocation = 0.999 The script is a single catsn job; the older log is with no maxwalltime specified; the more recent log is for a maxwalltime of 30 secons specified in sites.xml: 3600 00:00:30 6 6 1 1 shared 5.99 10000 parvis 0.999 0.999 /home/wilde/amwg/run01 ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" > Cc: "Swift Devel" > Sent: Wednesday, September 14, 2011 5:57:42 PM > Subject: [Swift-devel] Intentional change to behavior of high/lowOverAllocation parameter? > Mihael, > > For quite a while now I have been telling people to set the coaster > parameters lowOverAllocation and highOverAllocation to 100 to force > coasters to set the time allocation of every block to maxTime. (This > was based on your advice around the time we started running on Beagle, > to enable the user to force a specific and constant PBS job walltime). > > Cog rev 3225 seems to have introduced a change that insists that these > two parameters have a value < 1.0: > + checkLessThan("lowOverallocation", 1); > + checkLessThan("highOverallocation", 1); > > If not, the job fails with an exception: > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > lowOverallocation must be < 1.0 (currently 100.0) > > In addition, when I *do* set both parameters to 0.9999 (to try to > achieve the same effect as they gave before) then I encounter the > phenomenon of coasters starting with a ~10 minute walltime, but my app > job doesnt seem to "fit" into the coaster block, and hence Swift just > idles making no progress. If I try to reduce my app maxwalltime, then > walltime of the PBS job is lowered, but the app job still doesnt "fit" > and never gets run. I can send a log if the cause for this not obvious > to you. > > Can you explain what the intended behavior is here, and whether you > think the new check (circa Aug 7) introduced a bug? > > Thanks, > > - Mike > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Sep 14 22:30:48 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 14 Sep 2011 20:30:48 -0700 Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1230449294.6220.1316029598495.JavaMail.root@zimbra.anl.gov> References: <1230449294.6220.1316029598495.JavaMail.root@zimbra.anl.gov> Message-ID: <1316057448.2368.0.camel@blabla> Could you also forward the attachments please? Mihael On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde wrote: > I think I am seeing a similar deadlock on 0.93 in the ParVis script, and am trying to get a clean log and jstack to confirm. > > As far as I can tell, Papia is running the correct 0.93 code, but please verify. > > David will try to replicate this problem as well. > > - Mike > > ----- Original Message ----- > > From: "Papia Rizwan" > > To: "swift-devel Devel" , "Michael Wilde" , "Michael P. Shields" > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > Subject: swift 0.93 deadlock > > Attached are the jstack output and the log file. > > > > -- > > Papia Rizwan > From hategan at mcs.anl.gov Wed Sep 14 22:39:42 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 14 Sep 2011 20:39:42 -0700 Subject: [Swift-devel] Intentional change to behavior of high/lowOverAllocation parameter? In-Reply-To: <562930901.7316.1316041062775.JavaMail.root@zimbra.anl.gov> References: <562930901.7316.1316041062775.JavaMail.root@zimbra.anl.gov> Message-ID: <1316057982.2767.1.camel@blabla> On Wed, 2011-09-14 at 17:57 -0500, Michael Wilde wrote: > Mihael, > > For quite a while now I have been telling people to set the coaster parameters lowOverAllocation and highOverAllocation to 100 to force coasters to set the time allocation of every block to maxTime. (This was based on your advice around the time we started running on Beagle, to enable the user to force a specific and constant PBS job walltime). > > Cog rev 3225 seems to have introduced a change that insists that these two parameters have a value < 1.0: > + checkLessThan("lowOverallocation", 1); > + checkLessThan("highOverallocation", 1); Whoops! I clearly meant GreaterThan. I'm not sure why my computer didn't commit what I wanted instead of what I wrote. r3273. From hategan at mcs.anl.gov Wed Sep 14 22:43:27 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 14 Sep 2011 20:43:27 -0700 Subject: [Swift-devel] Cache error in Swift In-Reply-To: <8591A24C-8842-4FFA-B21A-94D66394CC5E@mcs.anl.gov> References: <8591A24C-8842-4FFA-B21A-94D66394CC5E@mcs.anl.gov> Message-ID: <1316058207.2767.4.camel@blabla> I suppose it might be possible to have a race condition when using replication such that two jobs in the same replication group complete. But before I go and dig into that, can you double-check that you are not mapping two things to the same file? On Wed, 2011-09-14 at 17:33 -0500, Jonathan Monette wrote: > Hello, > I just ran my SwiftMontage scripts again with the most recent build of 0.93 source. I received this error > Execution failed: > The cache already contains pads:montage-20110914-1717-nrothtg3/shared/proj_dir/proj_2mass-atlas-000713s-j0870197.fits. > > I received this error after 3652 tasks. All the files are located in /home/jonmon/PADS/Swift/SwiftMontage/big/run.0013 on the CI machines. > > I wanted to try replications. I have noticed in my scripts that when PADS is full of jobs Swift doesn't try to resubmit to Beagle even though Beagle shows through showq that it has room for jobs. I thought replications would use both sites more efficiently. What I mean is I thought that replications would replicate a jobs onto Beagle since PADS is taking so long just sitting in the queue. > > Please educate me if this is not what I should be doing and if in fact there is no work around to this problem. From jonmon at mcs.anl.gov Wed Sep 14 23:03:11 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Wed, 14 Sep 2011 23:03:11 -0500 Subject: [Swift-devel] Cache error in Swift In-Reply-To: <1316058207.2767.4.camel@blabla> References: <8591A24C-8842-4FFA-B21A-94D66394CC5E@mcs.anl.gov> <1316058207.2767.4.camel@blabla> Message-ID: I am pretty sure I am not. All the input files are unique and the output files are all mapped using the regexp mapper and appending proj_ to the front. But I will check anyways. I seem to be having the error without replications as well. Let me try a couple more runs to see if I added something erroneously. On Sep 14, 2011, at 10:43 PM, Mihael Hategan wrote: > I suppose it might be possible to have a race condition when using > replication such that two jobs in the same replication group complete. > > But before I go and dig into that, can you double-check that you are not > mapping two things to the same file? > > On Wed, 2011-09-14 at 17:33 -0500, Jonathan Monette wrote: >> Hello, >> I just ran my SwiftMontage scripts again with the most recent build of 0.93 source. I received this error >> Execution failed: >> The cache already contains pads:montage-20110914-1717-nrothtg3/shared/proj_dir/proj_2mass-atlas-000713s-j0870197.fits. >> >> I received this error after 3652 tasks. All the files are located in /home/jonmon/PADS/Swift/SwiftMontage/big/run.0013 on the CI machines. >> >> I wanted to try replications. I have noticed in my scripts that when PADS is full of jobs Swift doesn't try to resubmit to Beagle even though Beagle shows through showq that it has room for jobs. I thought replications would use both sites more efficiently. What I mean is I thought that replications would replicate a jobs onto Beagle since PADS is taking so long just sitting in the queue. >> >> Please educate me if this is not what I should be doing and if in fact there is no work around to this problem. > > From davidk at ci.uchicago.edu Wed Sep 14 23:04:41 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 14 Sep 2011 23:04:41 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1316057448.2368.0.camel@blabla> Message-ID: <1031377656.105634.1316059481205.JavaMail.root@zimbra-mb2.anl.gov> I was able to reproduce the problem with persistent coasters on the MCS servers. The jstack output is at http://www.ci.uchicago.edu/~davidk/swat/jstack.log The full collection of logs are at http://www.ci.uchicago.edu/~davidk/swat. David ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "swift-devel Devel" , "Papia Rizwan" > Sent: Wednesday, September 14, 2011 10:30:48 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > Could you also forward the attachments please? > > Mihael > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde wrote: > > I think I am seeing a similar deadlock on 0.93 in the ParVis script, > > and am trying to get a clean log and jstack to confirm. > > > > As far as I can tell, Papia is running the correct 0.93 code, but > > please verify. > > > > David will try to replicate this problem as well. > > > > - Mike > > > > ----- Original Message ----- > > > From: "Papia Rizwan" > > > To: "swift-devel Devel" , "Michael > > > Wilde" , "Michael P. Shields" > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > Subject: swift 0.93 deadlock > > > Attached are the jstack output and the log file. > > > > > > -- > > > Papia Rizwan > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Thu Sep 15 05:54:11 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 15 Sep 2011 05:54:11 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1031377656.105634.1316059481205.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1832751611.7926.1316084051743.JavaMail.root@zimbra.anl.gov> David, which of the many Swift logs in that /swat dir does the jstack.log pertain to? How many of these runs deadlocked? And, did you verify that you (and Papia) are running on the latest rev of the 0.93 branch? - Mike ----- Original Message ----- > From: "David Kelly" > To: "Mihael Hategan" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Michael Wilde" > > Sent: Wednesday, September 14, 2011 11:04:41 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > I was able to reproduce the problem with persistent coasters on the > MCS servers. > > The jstack output is at > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > The full collection of logs are at > http://www.ci.uchicago.edu/~davidk/swat. > > David > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > Could you also forward the attachments please? > > > > Mihael > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde wrote: > > > I think I am seeing a similar deadlock on 0.93 in the ParVis > > > script, > > > and am trying to get a clean log and jstack to confirm. > > > > > > As far as I can tell, Papia is running the correct 0.93 code, but > > > please verify. > > > > > > David will try to replicate this problem as well. > > > > > > - Mike > > > > > > ----- Original Message ----- > > > > From: "Papia Rizwan" > > > > To: "swift-devel Devel" , "Michael > > > > Wilde" , "Michael P. Shields" > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > Subject: swift 0.93 deadlock > > > > Attached are the jstack output and the log file. > > > > > > > > -- > > > > Papia Rizwan > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Sep 15 05:57:10 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 15 Sep 2011 05:57:10 -0500 (CDT) Subject: [Swift-devel] Cache error in Swift In-Reply-To: Message-ID: <272818988.7928.1316084230206.JavaMail.root@zimbra.anl.gov> Jon, can you log your mappings? Or can you verify them from the Swift logs? - Mike ----- Original Message ----- > From: "Jonathan Monette" > To: "Mihael Hategan" > Cc: "swift-devel Devel" , "Michael Wilde" > Sent: Wednesday, September 14, 2011 11:03:11 PM > Subject: Re: Cache error in Swift > I am pretty sure I am not. All the input files are unique and the > output files are all mapped using the regexp mapper and appending > proj_ to the front. > > But I will check anyways. I seem to be having the error without > replications as well. Let me try a couple more runs to see if I added > something erroneously. > > On Sep 14, 2011, at 10:43 PM, Mihael Hategan wrote: > > > I suppose it might be possible to have a race condition when using > > replication such that two jobs in the same replication group > > complete. > > > > But before I go and dig into that, can you double-check that you are > > not > > mapping two things to the same file? > > > > On Wed, 2011-09-14 at 17:33 -0500, Jonathan Monette wrote: > >> Hello, > >> I just ran my SwiftMontage scripts again with the most recent > >> build of 0.93 source. I received this error > >> Execution failed: > >> The cache already contains > >> pads:montage-20110914-1717-nrothtg3/shared/proj_dir/proj_2mass-atlas-000713s-j0870197.fits. > >> > >> I received this error after 3652 tasks. All the files are located > >> in /home/jonmon/PADS/Swift/SwiftMontage/big/run.0013 on the CI > >> machines. > >> > >> I wanted to try replications. I have noticed in my scripts that > >> when PADS is full of jobs Swift doesn't try to resubmit to Beagle > >> even though Beagle shows through showq that it has room for jobs. I > >> thought replications would use both sites more efficiently. What I > >> mean is I thought that replications would replicate a jobs onto > >> Beagle since PADS is taking so long just sitting in the queue. > >> > >> Please educate me if this is not what I should be doing and if in > >> fact there is no work around to this problem. > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Thu Sep 15 08:03:09 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 15 Sep 2011 08:03:09 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1832751611.7926.1316084051743.JavaMail.root@zimbra.anl.gov> Message-ID: <697720960.105844.1316091789476.JavaMail.root@zimbra-mb2.anl.gov> The jstack log corresponds to the most recent log file - http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. jstack does not report any deadlocks, but I thought it might be useful so I included it. Swift was not making any progress for about 5 hours before I sent the logs. I am running the latest 0.93 branch. I will try again today. David ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Mihael Hategan" > > Sent: Thursday, September 15, 2011 5:54:11 AM > Subject: Re: [Swift-devel] swift 0.93 deadlock > David, which of the many Swift logs in that /swat dir does the > jstack.log pertain to? How many of these runs deadlocked? > > And, did you verify that you (and Papia) are running on the latest rev > of the 0.93 branch? > > - Mike > > ----- Original Message ----- > > From: "David Kelly" > > To: "Mihael Hategan" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "Michael Wilde" > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > I was able to reproduce the problem with persistent coasters on the > > MCS servers. > > > > The jstack output is at > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > The full collection of logs are at > > http://www.ci.uchicago.edu/~davidk/swat. > > > > David > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Michael Wilde" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > Could you also forward the attachments please? > > > > > > Mihael > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde wrote: > > > > I think I am seeing a similar deadlock on 0.93 in the ParVis > > > > script, > > > > and am trying to get a clean log and jstack to confirm. > > > > > > > > As far as I can tell, Papia is running the correct 0.93 code, > > > > but > > > > please verify. > > > > > > > > David will try to replicate this problem as well. > > > > > > > > - Mike > > > > > > > > ----- Original Message ----- > > > > > From: "Papia Rizwan" > > > > > To: "swift-devel Devel" , > > > > > "Michael > > > > > Wilde" , "Michael P. Shields" > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > Subject: swift 0.93 deadlock > > > > > Attached are the jstack output and the log file. > > > > > > > > > > -- > > > > > Papia Rizwan > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From davidk at ci.uchicago.edu Thu Sep 15 10:16:44 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 15 Sep 2011 10:16:44 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1832751611.7926.1316084051743.JavaMail.root@zimbra.anl.gov> Message-ID: <598316145.106212.1316099804134.JavaMail.root@zimbra-mb2.anl.gov> I did see a deadlock this morning. Here are the relevant logs http://www.ci.uchicago.edu/~davidk/swat2/jstack.log http://www.ci.uchicago.edu/~davidk/swat2/cce_ua-20110915-0948-2m33dgh3.log David ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Mihael Hategan" > > Sent: Thursday, September 15, 2011 5:54:11 AM > Subject: Re: [Swift-devel] swift 0.93 deadlock > David, which of the many Swift logs in that /swat dir does the > jstack.log pertain to? How many of these runs deadlocked? > > And, did you verify that you (and Papia) are running on the latest rev > of the 0.93 branch? > > - Mike > > ----- Original Message ----- > > From: "David Kelly" > > To: "Mihael Hategan" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "Michael Wilde" > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > I was able to reproduce the problem with persistent coasters on the > > MCS servers. > > > > The jstack output is at > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > The full collection of logs are at > > http://www.ci.uchicago.edu/~davidk/swat. > > > > David > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Michael Wilde" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > Could you also forward the attachments please? > > > > > > Mihael > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde wrote: > > > > I think I am seeing a similar deadlock on 0.93 in the ParVis > > > > script, > > > > and am trying to get a clean log and jstack to confirm. > > > > > > > > As far as I can tell, Papia is running the correct 0.93 code, > > > > but > > > > please verify. > > > > > > > > David will try to replicate this problem as well. > > > > > > > > - Mike > > > > > > > > ----- Original Message ----- > > > > > From: "Papia Rizwan" > > > > > To: "swift-devel Devel" , > > > > > "Michael > > > > > Wilde" , "Michael P. Shields" > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > Subject: swift 0.93 deadlock > > > > > Attached are the jstack output and the log file. > > > > > > > > > > -- > > > > > Papia Rizwan > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From jonmon at mcs.anl.gov Thu Sep 15 10:54:53 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Thu, 15 Sep 2011 10:54:53 -0500 Subject: [Swift-devel] Cache error in Swift In-Reply-To: <272818988.7928.1316084230206.JavaMail.root@zimbra.anl.gov> References: <272818988.7928.1316084230206.JavaMail.root@zimbra.anl.gov> Message-ID: <6988661B-E1F1-4534-92B0-E712257DE906@mcs.anl.gov> My mappings seem to be correct. It does not look like I am trying to map to things to the same file. On Sep 15, 2011, at 5:57 AM, Michael Wilde wrote: > Jon, can you log your mappings? Or can you verify them from the Swift logs? > > - Mike > > ----- Original Message ----- >> From: "Jonathan Monette" >> To: "Mihael Hategan" >> Cc: "swift-devel Devel" , "Michael Wilde" >> Sent: Wednesday, September 14, 2011 11:03:11 PM >> Subject: Re: Cache error in Swift >> I am pretty sure I am not. All the input files are unique and the >> output files are all mapped using the regexp mapper and appending >> proj_ to the front. >> >> But I will check anyways. I seem to be having the error without >> replications as well. Let me try a couple more runs to see if I added >> something erroneously. >> >> On Sep 14, 2011, at 10:43 PM, Mihael Hategan wrote: >> >>> I suppose it might be possible to have a race condition when using >>> replication such that two jobs in the same replication group >>> complete. >>> >>> But before I go and dig into that, can you double-check that you are >>> not >>> mapping two things to the same file? >>> >>> On Wed, 2011-09-14 at 17:33 -0500, Jonathan Monette wrote: >>>> Hello, >>>> I just ran my SwiftMontage scripts again with the most recent >>>> build of 0.93 source. I received this error >>>> Execution failed: >>>> The cache already contains >>>> pads:montage-20110914-1717-nrothtg3/shared/proj_dir/proj_2mass-atlas-000713s-j0870197.fits. >>>> >>>> I received this error after 3652 tasks. All the files are located >>>> in /home/jonmon/PADS/Swift/SwiftMontage/big/run.0013 on the CI >>>> machines. >>>> >>>> I wanted to try replications. I have noticed in my scripts that >>>> when PADS is full of jobs Swift doesn't try to resubmit to Beagle >>>> even though Beagle shows through showq that it has room for jobs. I >>>> thought replications would use both sites more efficiently. What I >>>> mean is I thought that replications would replicate a jobs onto >>>> Beagle since PADS is taking so long just sitting in the queue. >>>> >>>> Please educate me if this is not what I should be doing and if in >>>> fact there is no work around to this problem. >>> >>> > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From wilde at mcs.anl.gov Thu Sep 15 11:46:59 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 15 Sep 2011 11:46:59 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <697720960.105844.1316091789476.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <982016274.9555.1316105219728.JavaMail.root@zimbra.anl.gov> David, it sounds like more analysis is needed here. If the SWAT runs are not showing a deadlock (but your runs are) then likely we have two different problems here. Another case we saw in 0.93 with scripts failing to progress is due to the overAllocation parameter problem that Mihael fixed yesterday. The symptom there is that Swift starts a coaster with a time slot too small for the apps in the script, and no apps wind up running. I think that situation in general merits a separate ticket, and may have been discussed on swift-devel (but quite a while ago). Can you determine if indeed Papia's SWAT runs are hanging for a reason other than a Java deadlock? - Mike ----- Original Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Mihael Hategan" > > Sent: Thursday, September 15, 2011 8:03:09 AM > Subject: Re: [Swift-devel] swift 0.93 deadlock > The jstack log corresponds to the most recent log file - > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > jstack does not report any deadlocks, but I thought it might be useful > so I included it. Swift was not making any progress for about 5 hours > before I sent the logs. I am running the latest 0.93 branch. I will > try again today. > > David > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "David Kelly" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "Mihael Hategan" > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > David, which of the many Swift logs in that /swat dir does the > > jstack.log pertain to? How many of these runs deadlocked? > > > > And, did you verify that you (and Papia) are running on the latest > > rev > > of the 0.93 branch? > > > > - Mike > > > > ----- Original Message ----- > > > From: "David Kelly" > > > To: "Mihael Hategan" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Michael Wilde" > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > I was able to reproduce the problem with persistent coasters on > > > the > > > MCS servers. > > > > > > The jstack output is at > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > The full collection of logs are at > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > David > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "Michael Wilde" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > Could you also forward the attachments please? > > > > > > > > Mihael > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde wrote: > > > > > I think I am seeing a similar deadlock on 0.93 in the ParVis > > > > > script, > > > > > and am trying to get a clean log and jstack to confirm. > > > > > > > > > > As far as I can tell, Papia is running the correct 0.93 code, > > > > > but > > > > > please verify. > > > > > > > > > > David will try to replicate this problem as well. > > > > > > > > > > - Mike > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Papia Rizwan" > > > > > > To: "swift-devel Devel" , > > > > > > "Michael > > > > > > Wilde" , "Michael P. Shields" > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > Subject: swift 0.93 deadlock > > > > > > Attached are the jstack output and the log file. > > > > > > > > > > > > -- > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Thu Sep 15 12:29:03 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 15 Sep 2011 12:29:03 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <982016274.9555.1316105219728.JavaMail.root@zimbra.anl.gov> Message-ID: <1147734726.106619.1316107743944.JavaMail.root@zimbra-mb2.anl.gov> I narrowed down the problem a bit. Last night I ran jstack on the wrong java process which is why it didn't report a deadlock. Papia and I are seeing the same issue. My jstack: http://www.ci.uchicago.edu/~davidk/swat2/jstack.log Papia's jstack: http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log It happens in the same place: org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) Filed as bug #559 David ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Mihael Hategan" > > Sent: Thursday, September 15, 2011 11:46:59 AM > Subject: Re: [Swift-devel] swift 0.93 deadlock > David, it sounds like more analysis is needed here. If the SWAT runs > are not showing a deadlock (but your runs are) then likely we have two > different problems here. > > Another case we saw in 0.93 with scripts failing to progress is due to > the overAllocation parameter problem that Mihael fixed yesterday. The > symptom there is that Swift starts a coaster with a time slot too > small for the apps in the script, and no apps wind up running. I think > that situation in general merits a separate ticket, and may have been > discussed on swift-devel (but quite a while ago). > > Can you determine if indeed Papia's SWAT runs are hanging for a reason > other than a Java deadlock? > > - Mike > > > ----- Original Message ----- > > From: "David Kelly" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "Mihael Hategan" > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > The jstack log corresponds to the most recent log file - > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > jstack does not report any deadlocks, but I thought it might be > > useful > > so I included it. Swift was not making any progress for about 5 > > hours > > before I sent the logs. I am running the latest 0.93 branch. I will > > try again today. > > > > David > > > > ----- Original Message ----- > > > From: "Michael Wilde" > > > To: "David Kelly" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Mihael Hategan" > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > David, which of the many Swift logs in that /swat dir does the > > > jstack.log pertain to? How many of these runs deadlocked? > > > > > > And, did you verify that you (and Papia) are running on the latest > > > rev > > > of the 0.93 branch? > > > > > > - Mike > > > > > > ----- Original Message ----- > > > > From: "David Kelly" > > > > To: "Mihael Hategan" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Michael Wilde" > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > I was able to reproduce the problem with persistent coasters on > > > > the > > > > MCS servers. > > > > > > > > The jstack output is at > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > The full collection of logs are at > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > From: "Mihael Hategan" > > > > > To: "Michael Wilde" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > Could you also forward the attachments please? > > > > > > > > > > Mihael > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde wrote: > > > > > > I think I am seeing a similar deadlock on 0.93 in the ParVis > > > > > > script, > > > > > > and am trying to get a clean log and jstack to confirm. > > > > > > > > > > > > As far as I can tell, Papia is running the correct 0.93 > > > > > > code, > > > > > > but > > > > > > please verify. > > > > > > > > > > > > David will try to replicate this problem as well. > > > > > > > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Papia Rizwan" > > > > > > > To: "swift-devel Devel" , > > > > > > > "Michael > > > > > > > Wilde" , "Michael P. Shields" > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > > Subject: swift 0.93 deadlock > > > > > > > Attached are the jstack output and the log file. > > > > > > > > > > > > > > -- > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From wilde at mcs.anl.gov Thu Sep 15 12:37:13 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 15 Sep 2011 12:37:13 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1147734726.106619.1316107743944.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1034010818.9849.1316108233352.JavaMail.root@zimbra.anl.gov> Excellent, thanks - thats good. I also just verified that Papia is not using the overAllocation tags in the sites file, so this problem is clearly a Java deadlock and has nothing to do with the scheduling problem that the (now fixed) overAllocation problem was causing.. My understanding is that this SWAT script is failing under trunk because of the recent token case handling issue (I think the camel-case one). Can you work with Papia to see if either that issue is now fixed, or if her script can be changed to avoid that, so that you can both test the SWAT script with trunk, to see if the deadlock still occurs? Thanks, - MIke ----- Original Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Mihael Hategan" > > Sent: Thursday, September 15, 2011 12:29:03 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > I narrowed down the problem a bit. Last night I ran jstack on the > wrong java process which is why it didn't report a deadlock. > > Papia and I are seeing the same issue. > > My jstack: http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > Papia's jstack: > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > It happens in the same place: > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > Filed as bug #559 > > David > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "David Kelly" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "Mihael Hategan" > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > David, it sounds like more analysis is needed here. If the SWAT runs > > are not showing a deadlock (but your runs are) then likely we have > > two > > different problems here. > > > > Another case we saw in 0.93 with scripts failing to progress is due > > to > > the overAllocation parameter problem that Mihael fixed yesterday. > > The > > symptom there is that Swift starts a coaster with a time slot too > > small for the apps in the script, and no apps wind up running. I > > think > > that situation in general merits a separate ticket, and may have > > been > > discussed on swift-devel (but quite a while ago). > > > > Can you determine if indeed Papia's SWAT runs are hanging for a > > reason > > other than a Java deadlock? > > > > - Mike > > > > > > ----- Original Message ----- > > > From: "David Kelly" > > > To: "Michael Wilde" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Mihael Hategan" > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > The jstack log corresponds to the most recent log file - > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > jstack does not report any deadlocks, but I thought it might be > > > useful > > > so I included it. Swift was not making any progress for about 5 > > > hours > > > before I sent the logs. I am running the latest 0.93 branch. I > > > will > > > try again today. > > > > > > David > > > > > > ----- Original Message ----- > > > > From: "Michael Wilde" > > > > To: "David Kelly" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Mihael Hategan" > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > David, which of the many Swift logs in that /swat dir does the > > > > jstack.log pertain to? How many of these runs deadlocked? > > > > > > > > And, did you verify that you (and Papia) are running on the > > > > latest > > > > rev > > > > of the 0.93 branch? > > > > > > > > - Mike > > > > > > > > ----- Original Message ----- > > > > > From: "David Kelly" > > > > > To: "Mihael Hategan" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Michael Wilde" > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > I was able to reproduce the problem with persistent coasters > > > > > on > > > > > the > > > > > MCS servers. > > > > > > > > > > The jstack output is at > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > The full collection of logs are at > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Mihael Hategan" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > Mihael > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde wrote: > > > > > > > I think I am seeing a similar deadlock on 0.93 in the > > > > > > > ParVis > > > > > > > script, > > > > > > > and am trying to get a clean log and jstack to confirm. > > > > > > > > > > > > > > As far as I can tell, Papia is running the correct 0.93 > > > > > > > code, > > > > > > > but > > > > > > > please verify. > > > > > > > > > > > > > > David will try to replicate this problem as well. > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Papia Rizwan" > > > > > > > > To: "swift-devel Devel" , > > > > > > > > "Michael > > > > > > > > Wilde" , "Michael P. Shields" > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > Attached are the jstack output and the log file. > > > > > > > > > > > > > > > > -- > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Thu Sep 15 14:15:36 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 15 Sep 2011 14:15:36 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1034010818.9849.1316108233352.JavaMail.root@zimbra.anl.gov> Message-ID: <40697929.106847.1316114136267.JavaMail.root@zimbra-mb2.anl.gov> I got past the compilation errors by renaming the all functions with capitalization, but ran into an issue with coaster-service. Last week I noticed coaster-service was missing options for dynamic ports. I found today that it is also missing -passive. I'll try to track down where this changed and restore the previous version. David ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Mihael Hategan" > > Sent: Thursday, September 15, 2011 12:37:13 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > Excellent, thanks - thats good. I also just verified that Papia is not > using the overAllocation tags in the sites file, so this problem is > clearly a Java deadlock and has nothing to do with the scheduling > problem that the (now fixed) overAllocation problem was causing.. > > My understanding is that this SWAT script is failing under trunk > because of the recent token case handling issue (I think the > camel-case one). Can you work with Papia to see if either that issue > is now fixed, or if her script can be changed to avoid that, so that > you can both test the SWAT script with trunk, to see if the deadlock > still occurs? > > Thanks, > > - MIke > > > ----- Original Message ----- > > From: "David Kelly" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "Mihael Hategan" > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > I narrowed down the problem a bit. Last night I ran jstack on the > > wrong java process which is why it didn't report a deadlock. > > > > Papia and I are seeing the same issue. > > > > My jstack: http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > Papia's jstack: > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > It happens in the same place: > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > Filed as bug #559 > > > > David > > > > ----- Original Message ----- > > > From: "Michael Wilde" > > > To: "David Kelly" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Mihael Hategan" > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > David, it sounds like more analysis is needed here. If the SWAT > > > runs > > > are not showing a deadlock (but your runs are) then likely we have > > > two > > > different problems here. > > > > > > Another case we saw in 0.93 with scripts failing to progress is > > > due > > > to > > > the overAllocation parameter problem that Mihael fixed yesterday. > > > The > > > symptom there is that Swift starts a coaster with a time slot too > > > small for the apps in the script, and no apps wind up running. I > > > think > > > that situation in general merits a separate ticket, and may have > > > been > > > discussed on swift-devel (but quite a while ago). > > > > > > Can you determine if indeed Papia's SWAT runs are hanging for a > > > reason > > > other than a Java deadlock? > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > From: "David Kelly" > > > > To: "Michael Wilde" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Mihael Hategan" > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > The jstack log corresponds to the most recent log file - > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > jstack does not report any deadlocks, but I thought it might be > > > > useful > > > > so I included it. Swift was not making any progress for about 5 > > > > hours > > > > before I sent the logs. I am running the latest 0.93 branch. I > > > > will > > > > try again today. > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > From: "Michael Wilde" > > > > > To: "David Kelly" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > David, which of the many Swift logs in that /swat dir does the > > > > > jstack.log pertain to? How many of these runs deadlocked? > > > > > > > > > > And, did you verify that you (and Papia) are running on the > > > > > latest > > > > > rev > > > > > of the 0.93 branch? > > > > > > > > > > - Mike > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Mihael Hategan" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" , "Michael Wilde" > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > I was able to reproduce the problem with persistent coasters > > > > > > on > > > > > > the > > > > > > MCS servers. > > > > > > > > > > > > The jstack output is at > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > The full collection of logs are at > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Mihael Hategan" > > > > > > > To: "Michael Wilde" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde wrote: > > > > > > > > I think I am seeing a similar deadlock on 0.93 in the > > > > > > > > ParVis > > > > > > > > script, > > > > > > > > and am trying to get a clean log and jstack to confirm. > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the correct 0.93 > > > > > > > > code, > > > > > > > > but > > > > > > > > please verify. > > > > > > > > > > > > > > > > David will try to replicate this problem as well. > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > To: "swift-devel Devel" , > > > > > > > > > "Michael > > > > > > > > > Wilde" , "Michael P. Shields" > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > Attached are the jstack output and the log file. > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From wilde at mcs.anl.gov Thu Sep 15 14:18:17 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 15 Sep 2011 14:18:17 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <40697929.106847.1316114136267.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <2137746985.10390.1316114297303.JavaMail.root@zimbra.anl.gov> Can you make SWAT run under trunk, as Papia is testing using standard auto coasters, and doesnt need any of the missing coaster-service options. - Mike ----- Original Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Mihael Hategan" > > Sent: Thursday, September 15, 2011 2:15:36 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > I got past the compilation errors by renaming the all functions with > capitalization, but ran into an issue with coaster-service. Last week > I noticed coaster-service was missing options for dynamic ports. I > found today that it is also missing -passive. I'll try to track down > where this changed and restore the previous version. > > David > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "David Kelly" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "Mihael Hategan" > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > Excellent, thanks - thats good. I also just verified that Papia is > > not > > using the overAllocation tags in the sites file, so this problem is > > clearly a Java deadlock and has nothing to do with the scheduling > > problem that the (now fixed) overAllocation problem was causing.. > > > > My understanding is that this SWAT script is failing under trunk > > because of the recent token case handling issue (I think the > > camel-case one). Can you work with Papia to see if either that issue > > is now fixed, or if her script can be changed to avoid that, so that > > you can both test the SWAT script with trunk, to see if the deadlock > > still occurs? > > > > Thanks, > > > > - MIke > > > > > > ----- Original Message ----- > > > From: "David Kelly" > > > To: "Michael Wilde" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Mihael Hategan" > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > I narrowed down the problem a bit. Last night I ran jstack on the > > > wrong java process which is why it didn't report a deadlock. > > > > > > Papia and I are seeing the same issue. > > > > > > My jstack: http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > Papia's jstack: > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > It happens in the same place: > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > Filed as bug #559 > > > > > > David > > > > > > ----- Original Message ----- > > > > From: "Michael Wilde" > > > > To: "David Kelly" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Mihael Hategan" > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > David, it sounds like more analysis is needed here. If the SWAT > > > > runs > > > > are not showing a deadlock (but your runs are) then likely we > > > > have > > > > two > > > > different problems here. > > > > > > > > Another case we saw in 0.93 with scripts failing to progress is > > > > due > > > > to > > > > the overAllocation parameter problem that Mihael fixed > > > > yesterday. > > > > The > > > > symptom there is that Swift starts a coaster with a time slot > > > > too > > > > small for the apps in the script, and no apps wind up running. I > > > > think > > > > that situation in general merits a separate ticket, and may have > > > > been > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > Can you determine if indeed Papia's SWAT runs are hanging for a > > > > reason > > > > other than a Java deadlock? > > > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "David Kelly" > > > > > To: "Michael Wilde" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > The jstack log corresponds to the most recent log file - > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > jstack does not report any deadlocks, but I thought it might > > > > > be > > > > > useful > > > > > so I included it. Swift was not making any progress for about > > > > > 5 > > > > > hours > > > > > before I sent the logs. I am running the latest 0.93 branch. I > > > > > will > > > > > try again today. > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Michael Wilde" > > > > > > To: "David Kelly" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > David, which of the many Swift logs in that /swat dir does > > > > > > the > > > > > > jstack.log pertain to? How many of these runs deadlocked? > > > > > > > > > > > > And, did you verify that you (and Papia) are running on the > > > > > > latest > > > > > > rev > > > > > > of the 0.93 branch? > > > > > > > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "David Kelly" > > > > > > > To: "Mihael Hategan" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Michael Wilde" > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > I was able to reproduce the problem with persistent > > > > > > > coasters > > > > > > > on > > > > > > > the > > > > > > > MCS servers. > > > > > > > > > > > > > > The jstack output is at > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > David > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Mihael Hategan" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde wrote: > > > > > > > > > I think I am seeing a similar deadlock on 0.93 in the > > > > > > > > > ParVis > > > > > > > > > script, > > > > > > > > > and am trying to get a clean log and jstack to > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the correct > > > > > > > > > 0.93 > > > > > > > > > code, > > > > > > > > > but > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > David will try to replicate this problem as well. > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Michael > > > > > > > > > > Wilde" , "Michael P. Shields" > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > Attached are the jstack output and the log file. > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > Swift-devel mailing list > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > -- > > > > > > Michael Wilde > > > > > > Computation Institute, University of Chicago > > > > > > Mathematics and Computer Science Division > > > > > > Argonne National Laboratory > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Thu Sep 15 14:39:47 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 15 Sep 2011 14:39:47 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <2137746985.10390.1316114297303.JavaMail.root@zimbra.anl.gov> Message-ID: <548801091.106912.1316115587455.JavaMail.root@zimbra-mb2.anl.gov> The sites.xml in /homes/papia/SwiftSCE2 seems to be using passive persistent coasters. Is there a way to use automatic coasters on the MCS workstations? I'll try copying this over to PADS and running there to see if I can reproduce it. David ----- Original Message ----- > From: "Michael Wilde" > To: "David Kelly" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Mihael Hategan" > > Sent: Thursday, September 15, 2011 2:18:17 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > Can you make SWAT run under trunk, as Papia is testing using standard > auto coasters, and doesnt need any of the missing coaster-service > options. > > - Mike > > > ----- Original Message ----- > > From: "David Kelly" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "Mihael Hategan" > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > I got past the compilation errors by renaming the all functions with > > capitalization, but ran into an issue with coaster-service. Last > > week > > I noticed coaster-service was missing options for dynamic ports. I > > found today that it is also missing -passive. I'll try to track down > > where this changed and restore the previous version. > > > > David > > > > ----- Original Message ----- > > > From: "Michael Wilde" > > > To: "David Kelly" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Mihael Hategan" > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > Excellent, thanks - thats good. I also just verified that Papia is > > > not > > > using the overAllocation tags in the sites file, so this problem > > > is > > > clearly a Java deadlock and has nothing to do with the scheduling > > > problem that the (now fixed) overAllocation problem was causing.. > > > > > > My understanding is that this SWAT script is failing under trunk > > > because of the recent token case handling issue (I think the > > > camel-case one). Can you work with Papia to see if either that > > > issue > > > is now fixed, or if her script can be changed to avoid that, so > > > that > > > you can both test the SWAT script with trunk, to see if the > > > deadlock > > > still occurs? > > > > > > Thanks, > > > > > > - MIke > > > > > > > > > ----- Original Message ----- > > > > From: "David Kelly" > > > > To: "Michael Wilde" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Mihael Hategan" > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > I narrowed down the problem a bit. Last night I ran jstack on > > > > the > > > > wrong java process which is why it didn't report a deadlock. > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > My jstack: http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > Papia's jstack: > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > It happens in the same place: > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > Filed as bug #559 > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > From: "Michael Wilde" > > > > > To: "David Kelly" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > David, it sounds like more analysis is needed here. If the > > > > > SWAT > > > > > runs > > > > > are not showing a deadlock (but your runs are) then likely we > > > > > have > > > > > two > > > > > different problems here. > > > > > > > > > > Another case we saw in 0.93 with scripts failing to progress > > > > > is > > > > > due > > > > > to > > > > > the overAllocation parameter problem that Mihael fixed > > > > > yesterday. > > > > > The > > > > > symptom there is that Swift starts a coaster with a time slot > > > > > too > > > > > small for the apps in the script, and no apps wind up running. > > > > > I > > > > > think > > > > > that situation in general merits a separate ticket, and may > > > > > have > > > > > been > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are hanging for > > > > > a > > > > > reason > > > > > other than a Java deadlock? > > > > > > > > > > - Mike > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > The jstack log corresponds to the most recent log file - > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > jstack does not report any deadlocks, but I thought it might > > > > > > be > > > > > > useful > > > > > > so I included it. Swift was not making any progress for > > > > > > about > > > > > > 5 > > > > > > hours > > > > > > before I sent the logs. I am running the latest 0.93 branch. > > > > > > I > > > > > > will > > > > > > try again today. > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Michael Wilde" > > > > > > > To: "David Kelly" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > David, which of the many Swift logs in that /swat dir does > > > > > > > the > > > > > > > jstack.log pertain to? How many of these runs deadlocked? > > > > > > > > > > > > > > And, did you verify that you (and Papia) are running on > > > > > > > the > > > > > > > latest > > > > > > > rev > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Mihael Hategan" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Michael Wilde" > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > I was able to reproduce the problem with persistent > > > > > > > > coasters > > > > > > > > on > > > > > > > > the > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > To: "Michael Wilde" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde > > > > > > > > > wrote: > > > > > > > > > > I think I am seeing a similar deadlock on 0.93 in > > > > > > > > > > the > > > > > > > > > > ParVis > > > > > > > > > > script, > > > > > > > > > > and am trying to get a clean log and jstack to > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the correct > > > > > > > > > > 0.93 > > > > > > > > > > code, > > > > > > > > > > but > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as well. > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Michael > > > > > > > > > > > Wilde" , "Michael P. Shields" > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > Attached are the jstack output and the log file. > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > Swift-devel mailing list > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > -- > > > > > > > Michael Wilde > > > > > > > Computation Institute, University of Chicago > > > > > > > Mathematics and Computer Science Division > > > > > > > Argonne National Laboratory > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From davidk at ci.uchicago.edu Thu Sep 15 16:34:02 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 15 Sep 2011 16:34:02 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <548801091.106912.1316115587455.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <736589018.107162.1316122442091.JavaMail.root@zimbra-mb2.anl.gov> I was able to get it running on PADS with trunk. I ran into the same issue. http://www.ci.uchicago.edu/~davidk/swat3/jstack.log http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log David ----- Original Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: "swift-devel Devel" , "Papia Rizwan" > Sent: Thursday, September 15, 2011 2:39:47 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > The sites.xml in /homes/papia/SwiftSCE2 seems to be using passive > persistent coasters. Is there a way to use automatic coasters on the > MCS workstations? I'll try copying this over to PADS and running there > to see if I can reproduce it. > > David > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "David Kelly" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "Mihael Hategan" > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > Can you make SWAT run under trunk, as Papia is testing using > > standard > > auto coasters, and doesnt need any of the missing coaster-service > > options. > > > > - Mike > > > > > > ----- Original Message ----- > > > From: "David Kelly" > > > To: "Michael Wilde" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Mihael Hategan" > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > I got past the compilation errors by renaming the all functions > > > with > > > capitalization, but ran into an issue with coaster-service. Last > > > week > > > I noticed coaster-service was missing options for dynamic ports. I > > > found today that it is also missing -passive. I'll try to track > > > down > > > where this changed and restore the previous version. > > > > > > David > > > > > > ----- Original Message ----- > > > > From: "Michael Wilde" > > > > To: "David Kelly" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Mihael Hategan" > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > Excellent, thanks - thats good. I also just verified that Papia > > > > is > > > > not > > > > using the overAllocation tags in the sites file, so this problem > > > > is > > > > clearly a Java deadlock and has nothing to do with the > > > > scheduling > > > > problem that the (now fixed) overAllocation problem was > > > > causing.. > > > > > > > > My understanding is that this SWAT script is failing under trunk > > > > because of the recent token case handling issue (I think the > > > > camel-case one). Can you work with Papia to see if either that > > > > issue > > > > is now fixed, or if her script can be changed to avoid that, so > > > > that > > > > you can both test the SWAT script with trunk, to see if the > > > > deadlock > > > > still occurs? > > > > > > > > Thanks, > > > > > > > > - MIke > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "David Kelly" > > > > > To: "Michael Wilde" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > I narrowed down the problem a bit. Last night I ran jstack on > > > > > the > > > > > wrong java process which is why it didn't report a deadlock. > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > My jstack: http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > Papia's jstack: > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > It happens in the same place: > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > Filed as bug #559 > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Michael Wilde" > > > > > > To: "David Kelly" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > David, it sounds like more analysis is needed here. If the > > > > > > SWAT > > > > > > runs > > > > > > are not showing a deadlock (but your runs are) then likely > > > > > > we > > > > > > have > > > > > > two > > > > > > different problems here. > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing to progress > > > > > > is > > > > > > due > > > > > > to > > > > > > the overAllocation parameter problem that Mihael fixed > > > > > > yesterday. > > > > > > The > > > > > > symptom there is that Swift starts a coaster with a time > > > > > > slot > > > > > > too > > > > > > small for the apps in the script, and no apps wind up > > > > > > running. > > > > > > I > > > > > > think > > > > > > that situation in general merits a separate ticket, and may > > > > > > have > > > > > > been > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are hanging > > > > > > for > > > > > > a > > > > > > reason > > > > > > other than a Java deadlock? > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "David Kelly" > > > > > > > To: "Michael Wilde" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > The jstack log corresponds to the most recent log file - > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > jstack does not report any deadlocks, but I thought it > > > > > > > might > > > > > > > be > > > > > > > useful > > > > > > > so I included it. Swift was not making any progress for > > > > > > > about > > > > > > > 5 > > > > > > > hours > > > > > > > before I sent the logs. I am running the latest 0.93 > > > > > > > branch. > > > > > > > I > > > > > > > will > > > > > > > try again today. > > > > > > > > > > > > > > David > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Michael Wilde" > > > > > > > > To: "David Kelly" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > David, which of the many Swift logs in that /swat dir > > > > > > > > does > > > > > > > > the > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are running on > > > > > > > > the > > > > > > > > latest > > > > > > > > rev > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "David Kelly" > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Michael Wilde" > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > I was able to reproduce the problem with persistent > > > > > > > > > coasters > > > > > > > > > on > > > > > > > > > the > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde > > > > > > > > > > wrote: > > > > > > > > > > > I think I am seeing a similar deadlock on 0.93 in > > > > > > > > > > > the > > > > > > > > > > > ParVis > > > > > > > > > > > script, > > > > > > > > > > > and am trying to get a clean log and jstack to > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the correct > > > > > > > > > > > 0.93 > > > > > > > > > > > code, > > > > > > > > > > > but > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as well. > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Michael > > > > > > > > > > > > Wilde" , "Michael P. Shields" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > Attached are the jstack output and the log file. > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > -- > > > > > > > > Michael Wilde > > > > > > > > Computation Institute, University of Chicago > > > > > > > > Mathematics and Computer Science Division > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > -- > > > > > > Michael Wilde > > > > > > Computation Institute, University of Chicago > > > > > > Mathematics and Computer Science Division > > > > > > Argonne National Laboratory > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Sep 15 17:18:25 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 15 Sep 2011 15:18:25 -0700 Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <736589018.107162.1316122442091.JavaMail.root@zimbra-mb2.anl.gov> References: <736589018.107162.1316122442091.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1316125105.4088.0.camel@blabla> Right. Seems to be in a spot that didn't change much lately. I'm a bit busy until saturday, so I'll try to fix it then. On Thu, 2011-09-15 at 16:34 -0500, David Kelly wrote: > I was able to get it running on PADS with trunk. I ran into the same issue. > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > David > > ----- Original Message ----- > > From: "David Kelly" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia Rizwan" > > Sent: Thursday, September 15, 2011 2:39:47 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using passive > > persistent coasters. Is there a way to use automatic coasters on the > > MCS workstations? I'll try copying this over to PADS and running there > > to see if I can reproduce it. > > > > David > > > > ----- Original Message ----- > > > From: "Michael Wilde" > > > To: "David Kelly" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Mihael Hategan" > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > Can you make SWAT run under trunk, as Papia is testing using > > > standard > > > auto coasters, and doesnt need any of the missing coaster-service > > > options. > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > From: "David Kelly" > > > > To: "Michael Wilde" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Mihael Hategan" > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > I got past the compilation errors by renaming the all functions > > > > with > > > > capitalization, but ran into an issue with coaster-service. Last > > > > week > > > > I noticed coaster-service was missing options for dynamic ports. I > > > > found today that it is also missing -passive. I'll try to track > > > > down > > > > where this changed and restore the previous version. > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > From: "Michael Wilde" > > > > > To: "David Kelly" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > Excellent, thanks - thats good. I also just verified that Papia > > > > > is > > > > > not > > > > > using the overAllocation tags in the sites file, so this problem > > > > > is > > > > > clearly a Java deadlock and has nothing to do with the > > > > > scheduling > > > > > problem that the (now fixed) overAllocation problem was > > > > > causing.. > > > > > > > > > > My understanding is that this SWAT script is failing under trunk > > > > > because of the recent token case handling issue (I think the > > > > > camel-case one). Can you work with Papia to see if either that > > > > > issue > > > > > is now fixed, or if her script can be changed to avoid that, so > > > > > that > > > > > you can both test the SWAT script with trunk, to see if the > > > > > deadlock > > > > > still occurs? > > > > > > > > > > Thanks, > > > > > > > > > > - MIke > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > I narrowed down the problem a bit. Last night I ran jstack on > > > > > > the > > > > > > wrong java process which is why it didn't report a deadlock. > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > My jstack: http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > Papia's jstack: > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Michael Wilde" > > > > > > > To: "David Kelly" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > David, it sounds like more analysis is needed here. If the > > > > > > > SWAT > > > > > > > runs > > > > > > > are not showing a deadlock (but your runs are) then likely > > > > > > > we > > > > > > > have > > > > > > > two > > > > > > > different problems here. > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing to progress > > > > > > > is > > > > > > > due > > > > > > > to > > > > > > > the overAllocation parameter problem that Mihael fixed > > > > > > > yesterday. > > > > > > > The > > > > > > > symptom there is that Swift starts a coaster with a time > > > > > > > slot > > > > > > > too > > > > > > > small for the apps in the script, and no apps wind up > > > > > > > running. > > > > > > > I > > > > > > > think > > > > > > > that situation in general merits a separate ticket, and may > > > > > > > have > > > > > > > been > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are hanging > > > > > > > for > > > > > > > a > > > > > > > reason > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > The jstack log corresponds to the most recent log file - > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > jstack does not report any deadlocks, but I thought it > > > > > > > > might > > > > > > > > be > > > > > > > > useful > > > > > > > > so I included it. Swift was not making any progress for > > > > > > > > about > > > > > > > > 5 > > > > > > > > hours > > > > > > > > before I sent the logs. I am running the latest 0.93 > > > > > > > > branch. > > > > > > > > I > > > > > > > > will > > > > > > > > try again today. > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Michael Wilde" > > > > > > > > > To: "David Kelly" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > David, which of the many Swift logs in that /swat dir > > > > > > > > > does > > > > > > > > > the > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are running on > > > > > > > > > the > > > > > > > > > latest > > > > > > > > > rev > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "David Kelly" > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Michael Wilde" > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > I was able to reproduce the problem with persistent > > > > > > > > > > coasters > > > > > > > > > > on > > > > > > > > > > the > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" > > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde > > > > > > > > > > > wrote: > > > > > > > > > > > > I think I am seeing a similar deadlock on 0.93 in > > > > > > > > > > > > the > > > > > > > > > > > > ParVis > > > > > > > > > > > > script, > > > > > > > > > > > > and am trying to get a clean log and jstack to > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the correct > > > > > > > > > > > > 0.93 > > > > > > > > > > > > code, > > > > > > > > > > > > but > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as well. > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > Wilde" , "Michael P. Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > Attached are the jstack output and the log file. > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Michael Wilde > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > -- > > > > > > > Michael Wilde > > > > > > > Computation Institute, University of Chicago > > > > > > > Mathematics and Computer Science Division > > > > > > > Argonne National Laboratory > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Thu Sep 15 18:55:25 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 15 Sep 2011 18:55:25 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <736589018.107162.1316122442091.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1981667591.107357.1316130925453.JavaMail.root@zimbra-mb2.anl.gov> Persistent coasters in trunk is fixed now. I think an older version of coaster-service somehow got checked in, so I ran a reverse merge and resolved the conflicts. I tested on mcs workstations with 1000 cats and all seems well. David ----- Original Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: "swift-devel Devel" , "Papia Rizwan" > Sent: Thursday, September 15, 2011 4:34:02 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > I was able to get it running on PADS with trunk. I ran into the same > issue. > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > David > > ----- Original Message ----- > > From: "David Kelly" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" > > Sent: Thursday, September 15, 2011 2:39:47 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using passive > > persistent coasters. Is there a way to use automatic coasters on the > > MCS workstations? I'll try copying this over to PADS and running > > there > > to see if I can reproduce it. > > > > David > > > > ----- Original Message ----- > > > From: "Michael Wilde" > > > To: "David Kelly" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Mihael Hategan" > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > Can you make SWAT run under trunk, as Papia is testing using > > > standard > > > auto coasters, and doesnt need any of the missing coaster-service > > > options. > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > From: "David Kelly" > > > > To: "Michael Wilde" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Mihael Hategan" > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > I got past the compilation errors by renaming the all functions > > > > with > > > > capitalization, but ran into an issue with coaster-service. Last > > > > week > > > > I noticed coaster-service was missing options for dynamic ports. > > > > I > > > > found today that it is also missing -passive. I'll try to track > > > > down > > > > where this changed and restore the previous version. > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > From: "Michael Wilde" > > > > > To: "David Kelly" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > Excellent, thanks - thats good. I also just verified that > > > > > Papia > > > > > is > > > > > not > > > > > using the overAllocation tags in the sites file, so this > > > > > problem > > > > > is > > > > > clearly a Java deadlock and has nothing to do with the > > > > > scheduling > > > > > problem that the (now fixed) overAllocation problem was > > > > > causing.. > > > > > > > > > > My understanding is that this SWAT script is failing under > > > > > trunk > > > > > because of the recent token case handling issue (I think the > > > > > camel-case one). Can you work with Papia to see if either that > > > > > issue > > > > > is now fixed, or if her script can be changed to avoid that, > > > > > so > > > > > that > > > > > you can both test the SWAT script with trunk, to see if the > > > > > deadlock > > > > > still occurs? > > > > > > > > > > Thanks, > > > > > > > > > > - MIke > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > I narrowed down the problem a bit. Last night I ran jstack > > > > > > on > > > > > > the > > > > > > wrong java process which is why it didn't report a deadlock. > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > My jstack: > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > Papia's jstack: > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Michael Wilde" > > > > > > > To: "David Kelly" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > David, it sounds like more analysis is needed here. If the > > > > > > > SWAT > > > > > > > runs > > > > > > > are not showing a deadlock (but your runs are) then likely > > > > > > > we > > > > > > > have > > > > > > > two > > > > > > > different problems here. > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing to > > > > > > > progress > > > > > > > is > > > > > > > due > > > > > > > to > > > > > > > the overAllocation parameter problem that Mihael fixed > > > > > > > yesterday. > > > > > > > The > > > > > > > symptom there is that Swift starts a coaster with a time > > > > > > > slot > > > > > > > too > > > > > > > small for the apps in the script, and no apps wind up > > > > > > > running. > > > > > > > I > > > > > > > think > > > > > > > that situation in general merits a separate ticket, and > > > > > > > may > > > > > > > have > > > > > > > been > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are hanging > > > > > > > for > > > > > > > a > > > > > > > reason > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > The jstack log corresponds to the most recent log file - > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > jstack does not report any deadlocks, but I thought it > > > > > > > > might > > > > > > > > be > > > > > > > > useful > > > > > > > > so I included it. Swift was not making any progress for > > > > > > > > about > > > > > > > > 5 > > > > > > > > hours > > > > > > > > before I sent the logs. I am running the latest 0.93 > > > > > > > > branch. > > > > > > > > I > > > > > > > > will > > > > > > > > try again today. > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Michael Wilde" > > > > > > > > > To: "David Kelly" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > David, which of the many Swift logs in that /swat dir > > > > > > > > > does > > > > > > > > > the > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are running > > > > > > > > > on > > > > > > > > > the > > > > > > > > > latest > > > > > > > > > rev > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "David Kelly" > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Michael Wilde" > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > I was able to reproduce the problem with persistent > > > > > > > > > > coasters > > > > > > > > > > on > > > > > > > > > > the > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" > > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde > > > > > > > > > > > wrote: > > > > > > > > > > > > I think I am seeing a similar deadlock on 0.93 > > > > > > > > > > > > in > > > > > > > > > > > > the > > > > > > > > > > > > ParVis > > > > > > > > > > > > script, > > > > > > > > > > > > and am trying to get a clean log and jstack to > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the > > > > > > > > > > > > correct > > > > > > > > > > > > 0.93 > > > > > > > > > > > > code, > > > > > > > > > > > > but > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > Wilde" , "Michael P. > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > Attached are the jstack output and the log > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Michael Wilde > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > -- > > > > > > > Michael Wilde > > > > > > > Computation Institute, University of Chicago > > > > > > > Mathematics and Computer Science Division > > > > > > > Argonne National Laboratory > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Sep 16 11:44:54 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 16 Sep 2011 11:44:54 -0500 (CDT) Subject: [Swift-devel] Integrity of trunk in SVN (was: Re: swift 0.93 deadlock) In-Reply-To: <1981667591.107357.1316130925453.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <379261672.13578.1316191494415.JavaMail.root@zimbra.anl.gov> Sounds good, thanks, David. 3 questions on this: - its not related to the 0.93 deadlock that is holding back the SWAT app, is it? (ie not related to that email thread?) - can you trace back how the code got damaged in SVN, and see if anything else got similarly back-leveled that we may not yet have detected in trunk? - can you take the action item of resuming nightly automated test suite execution on trunk, and period (or even nightly) testing on the latest release (to see if the suite catches occasional sporadically-occuring bugs) Thanks, - Mike ----- Original Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: "swift-devel Devel" , "Papia Rizwan" > Sent: Thursday, September 15, 2011 6:55:25 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > Persistent coasters in trunk is fixed now. I think an older version of > coaster-service somehow got checked in, so I ran a reverse merge and > resolved the conflicts. I tested on mcs workstations with 1000 cats > and all seems well. > > David > > ----- Original Message ----- > > From: "David Kelly" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" > > Sent: Thursday, September 15, 2011 4:34:02 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > I was able to get it running on PADS with trunk. I ran into the same > > issue. > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > David > > > > ----- Original Message ----- > > > From: "David Kelly" > > > To: "Michael Wilde" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using passive > > > persistent coasters. Is there a way to use automatic coasters on > > > the > > > MCS workstations? I'll try copying this over to PADS and running > > > there > > > to see if I can reproduce it. > > > > > > David > > > > > > ----- Original Message ----- > > > > From: "Michael Wilde" > > > > To: "David Kelly" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Mihael Hategan" > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > Can you make SWAT run under trunk, as Papia is testing using > > > > standard > > > > auto coasters, and doesnt need any of the missing > > > > coaster-service > > > > options. > > > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "David Kelly" > > > > > To: "Michael Wilde" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > I got past the compilation errors by renaming the all > > > > > functions > > > > > with > > > > > capitalization, but ran into an issue with coaster-service. > > > > > Last > > > > > week > > > > > I noticed coaster-service was missing options for dynamic > > > > > ports. > > > > > I > > > > > found today that it is also missing -passive. I'll try to > > > > > track > > > > > down > > > > > where this changed and restore the previous version. > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Michael Wilde" > > > > > > To: "David Kelly" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > Excellent, thanks - thats good. I also just verified that > > > > > > Papia > > > > > > is > > > > > > not > > > > > > using the overAllocation tags in the sites file, so this > > > > > > problem > > > > > > is > > > > > > clearly a Java deadlock and has nothing to do with the > > > > > > scheduling > > > > > > problem that the (now fixed) overAllocation problem was > > > > > > causing.. > > > > > > > > > > > > My understanding is that this SWAT script is failing under > > > > > > trunk > > > > > > because of the recent token case handling issue (I think the > > > > > > camel-case one). Can you work with Papia to see if either > > > > > > that > > > > > > issue > > > > > > is now fixed, or if her script can be changed to avoid that, > > > > > > so > > > > > > that > > > > > > you can both test the SWAT script with trunk, to see if the > > > > > > deadlock > > > > > > still occurs? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "David Kelly" > > > > > > > To: "Michael Wilde" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > I narrowed down the problem a bit. Last night I ran jstack > > > > > > > on > > > > > > > the > > > > > > > wrong java process which is why it didn't report a > > > > > > > deadlock. > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > My jstack: > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > Papia's jstack: > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > David > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Michael Wilde" > > > > > > > > To: "David Kelly" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > David, it sounds like more analysis is needed here. If > > > > > > > > the > > > > > > > > SWAT > > > > > > > > runs > > > > > > > > are not showing a deadlock (but your runs are) then > > > > > > > > likely > > > > > > > > we > > > > > > > > have > > > > > > > > two > > > > > > > > different problems here. > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing to > > > > > > > > progress > > > > > > > > is > > > > > > > > due > > > > > > > > to > > > > > > > > the overAllocation parameter problem that Mihael fixed > > > > > > > > yesterday. > > > > > > > > The > > > > > > > > symptom there is that Swift starts a coaster with a time > > > > > > > > slot > > > > > > > > too > > > > > > > > small for the apps in the script, and no apps wind up > > > > > > > > running. > > > > > > > > I > > > > > > > > think > > > > > > > > that situation in general merits a separate ticket, and > > > > > > > > may > > > > > > > > have > > > > > > > > been > > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are > > > > > > > > hanging > > > > > > > > for > > > > > > > > a > > > > > > > > reason > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "David Kelly" > > > > > > > > > To: "Michael Wilde" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > The jstack log corresponds to the most recent log file > > > > > > > > > - > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > jstack does not report any deadlocks, but I thought it > > > > > > > > > might > > > > > > > > > be > > > > > > > > > useful > > > > > > > > > so I included it. Swift was not making any progress > > > > > > > > > for > > > > > > > > > about > > > > > > > > > 5 > > > > > > > > > hours > > > > > > > > > before I sent the logs. I am running the latest 0.93 > > > > > > > > > branch. > > > > > > > > > I > > > > > > > > > will > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > To: "David Kelly" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > David, which of the many Swift logs in that /swat > > > > > > > > > > dir > > > > > > > > > > does > > > > > > > > > > the > > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are running > > > > > > > > > > on > > > > > > > > > > the > > > > > > > > > > latest > > > > > > > > > > rev > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Michael Wilde" > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > I was able to reproduce the problem with > > > > > > > > > > > persistent > > > > > > > > > > > coasters > > > > > > > > > > > on > > > > > > > > > > > the > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde > > > > > > > > > > > > wrote: > > > > > > > > > > > > > I think I am seeing a similar deadlock on 0.93 > > > > > > > > > > > > > in > > > > > > > > > > > > > the > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > script, > > > > > > > > > > > > > and am trying to get a clean log and jstack to > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the > > > > > > > > > > > > > correct > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > code, > > > > > > > > > > > > > but > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > Wilde" , "Michael P. > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > Attached are the jstack output and the log > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Michael Wilde > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > -- > > > > > > > > Michael Wilde > > > > > > > > Computation Institute, University of Chicago > > > > > > > > Mathematics and Computer Science Division > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > -- > > > > > > Michael Wilde > > > > > > Computation Institute, University of Chicago > > > > > > Mathematics and Computer Science Division > > > > > > Argonne National Laboratory > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Sep 16 11:50:55 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 16 Sep 2011 11:50:55 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <736589018.107162.1316122442091.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <248843162.13624.1316191855001.JavaMail.root@zimbra.anl.gov> David and Papia, can you report to the list what the status is of running the SWAT app? - I understand that Mihael will work on the 0.93 deadlock fix this weekend, which is great. - I understand that its happening on trunk as well - Papia, can you try to "perturb" the Swift code in the hopes that some equivalent but different code doesnt trip into the same bug? Ie try a different mapper, different variable strategy (ie arrays vs scalars, structs vs separate vars) just to see if you can work around this? Or, put in some shell logic to catch the hang and kill and re-run (or resume) Swift? if you just kill a hung script and then resume it, will it work? We could maybe alter the hang checker to kill swift on its own, with a return code or message that you could use to trigger a resume. Mike ----- Original Message ----- > From: "David Kelly" > To: "Michael Wilde" > Cc: "swift-devel Devel" , "Papia Rizwan" > Sent: Thursday, September 15, 2011 4:34:02 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > I was able to get it running on PADS with trunk. I ran into the same > issue. > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > David > > ----- Original Message ----- > > From: "David Kelly" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" > > Sent: Thursday, September 15, 2011 2:39:47 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using passive > > persistent coasters. Is there a way to use automatic coasters on the > > MCS workstations? I'll try copying this over to PADS and running > > there > > to see if I can reproduce it. > > > > David > > > > ----- Original Message ----- > > > From: "Michael Wilde" > > > To: "David Kelly" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Mihael Hategan" > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > Can you make SWAT run under trunk, as Papia is testing using > > > standard > > > auto coasters, and doesnt need any of the missing coaster-service > > > options. > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > From: "David Kelly" > > > > To: "Michael Wilde" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Mihael Hategan" > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > I got past the compilation errors by renaming the all functions > > > > with > > > > capitalization, but ran into an issue with coaster-service. Last > > > > week > > > > I noticed coaster-service was missing options for dynamic ports. > > > > I > > > > found today that it is also missing -passive. I'll try to track > > > > down > > > > where this changed and restore the previous version. > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > From: "Michael Wilde" > > > > > To: "David Kelly" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > Excellent, thanks - thats good. I also just verified that > > > > > Papia > > > > > is > > > > > not > > > > > using the overAllocation tags in the sites file, so this > > > > > problem > > > > > is > > > > > clearly a Java deadlock and has nothing to do with the > > > > > scheduling > > > > > problem that the (now fixed) overAllocation problem was > > > > > causing.. > > > > > > > > > > My understanding is that this SWAT script is failing under > > > > > trunk > > > > > because of the recent token case handling issue (I think the > > > > > camel-case one). Can you work with Papia to see if either that > > > > > issue > > > > > is now fixed, or if her script can be changed to avoid that, > > > > > so > > > > > that > > > > > you can both test the SWAT script with trunk, to see if the > > > > > deadlock > > > > > still occurs? > > > > > > > > > > Thanks, > > > > > > > > > > - MIke > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > I narrowed down the problem a bit. Last night I ran jstack > > > > > > on > > > > > > the > > > > > > wrong java process which is why it didn't report a deadlock. > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > My jstack: > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > Papia's jstack: > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Michael Wilde" > > > > > > > To: "David Kelly" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > David, it sounds like more analysis is needed here. If the > > > > > > > SWAT > > > > > > > runs > > > > > > > are not showing a deadlock (but your runs are) then likely > > > > > > > we > > > > > > > have > > > > > > > two > > > > > > > different problems here. > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing to > > > > > > > progress > > > > > > > is > > > > > > > due > > > > > > > to > > > > > > > the overAllocation parameter problem that Mihael fixed > > > > > > > yesterday. > > > > > > > The > > > > > > > symptom there is that Swift starts a coaster with a time > > > > > > > slot > > > > > > > too > > > > > > > small for the apps in the script, and no apps wind up > > > > > > > running. > > > > > > > I > > > > > > > think > > > > > > > that situation in general merits a separate ticket, and > > > > > > > may > > > > > > > have > > > > > > > been > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are hanging > > > > > > > for > > > > > > > a > > > > > > > reason > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > The jstack log corresponds to the most recent log file - > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > jstack does not report any deadlocks, but I thought it > > > > > > > > might > > > > > > > > be > > > > > > > > useful > > > > > > > > so I included it. Swift was not making any progress for > > > > > > > > about > > > > > > > > 5 > > > > > > > > hours > > > > > > > > before I sent the logs. I am running the latest 0.93 > > > > > > > > branch. > > > > > > > > I > > > > > > > > will > > > > > > > > try again today. > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Michael Wilde" > > > > > > > > > To: "David Kelly" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > David, which of the many Swift logs in that /swat dir > > > > > > > > > does > > > > > > > > > the > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are running > > > > > > > > > on > > > > > > > > > the > > > > > > > > > latest > > > > > > > > > rev > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "David Kelly" > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Michael Wilde" > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > I was able to reproduce the problem with persistent > > > > > > > > > > coasters > > > > > > > > > > on > > > > > > > > > > the > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" > > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde > > > > > > > > > > > wrote: > > > > > > > > > > > > I think I am seeing a similar deadlock on 0.93 > > > > > > > > > > > > in > > > > > > > > > > > > the > > > > > > > > > > > > ParVis > > > > > > > > > > > > script, > > > > > > > > > > > > and am trying to get a clean log and jstack to > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the > > > > > > > > > > > > correct > > > > > > > > > > > > 0.93 > > > > > > > > > > > > code, > > > > > > > > > > > > but > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > Wilde" , "Michael P. > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > Attached are the jstack output and the log > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Michael Wilde > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > -- > > > > > > > Michael Wilde > > > > > > > Computation Institute, University of Chicago > > > > > > > Mathematics and Computer Science Division > > > > > > > Argonne National Laboratory > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From dsk at ci.uchicago.edu Fri Sep 16 12:47:16 2011 From: dsk at ci.uchicago.edu (Daniel S. Katz) Date: Fri, 16 Sep 2011 12:47:16 -0500 Subject: [Swift-devel] globus provision and swift Message-ID: <12AB7A1A-2C45-4A53-9E79-42F399C95182@ci.uchicago.edu> I'm listening to Borjs'a talk on globus provision. An interesting project would be to see what could be done with this and Swift. For example, could we build something similar that would let people try Swift on EC2 using this? Not sure this specific instance makes sense, but the idea probably does. Dan -- Daniel S. Katz University of Chicago (773) 834-7186 (voice) (773) 834-6818 (fax) d.katz at ieee.org or dsk at ci.uchicago.edu http://www.ci.uchicago.edu/~dsk/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Sep 16 12:52:04 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 16 Sep 2011 12:52:04 -0500 (CDT) Subject: [Swift-devel] globus provision and swift In-Reply-To: <12AB7A1A-2C45-4A53-9E79-42F399C95182@ci.uchicago.edu> Message-ID: <508358790.13870.1316195524028.JavaMail.root@zimbra.anl.gov> Dan, good ideas. David, can you give a status update of whats working on EC2 and FutureGrid? Dan, you can view the latest draft of the instructions for FutureGrid Cloud usage at: http://www.ci.uchicago.edu/swift/wwwdev/guides/trunk/siteguide/siteguide.html#_futuregrid_quickstart_guide) - Mike ----- Original Message ----- > From: "Daniel S. Katz" > To: "swift-devel Devel" > Sent: Friday, September 16, 2011 12:47:16 PM > Subject: [Swift-devel] globus provision and swift > I'm listening to Borjs'a talk on globus provision. > > > An interesting project would be to see what could be done with this > and Swift. For example, could we build something similar that would > let people try Swift on EC2 using this? Not sure this specific > instance makes sense, but the idea probably does. > > > Dan > > > > > > -- > > > > > > > > > Daniel S. Katz > University of Chicago > (773) 834-7186 (voice) > (773) 834-6818 (fax) > d.katz at ieee.org or dsk at ci.uchicago.edu > http://www.ci.uchicago.edu/~dsk/ > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From dsk at ci.uchicago.edu Fri Sep 16 14:05:16 2011 From: dsk at ci.uchicago.edu (Daniel S. Katz) Date: Fri, 16 Sep 2011 14:05:16 -0500 Subject: [Swift-devel] globus provision and swift In-Reply-To: <508358790.13870.1316195524028.JavaMail.root@zimbra.anl.gov> References: <508358790.13870.1316195524028.JavaMail.root@zimbra.anl.gov> Message-ID: <863686C9-53B7-440D-BC33-CC50B07AEC66@ci.uchicago.edu> On Sep 16, 2011, at 12:52 PM, Michael Wilde wrote: > Dan, good ideas. > > David, can you give a status update of whats working on EC2 and FutureGrid? > > Dan, you can view the latest draft of the instructions for FutureGrid Cloud usage at: > > http://www.ci.uchicago.edu/swift/wwwdev/guides/trunk/siteguide/siteguide.html#_futuregrid_quickstart_guide) That looks fairly straightforward. Dan > > - Mike > > > ----- Original Message ----- >> From: "Daniel S. Katz" >> To: "swift-devel Devel" >> Sent: Friday, September 16, 2011 12:47:16 PM >> Subject: [Swift-devel] globus provision and swift >> I'm listening to Borjs'a talk on globus provision. >> >> >> An interesting project would be to see what could be done with this >> and Swift. For example, could we build something similar that would >> let people try Swift on EC2 using this? Not sure this specific >> instance makes sense, but the idea probably does. >> >> >> Dan >> >> >> >> >> >> -- >> >> >> >> >> >> >> >> >> Daniel S. Katz >> University of Chicago >> (773) 834-7186 (voice) >> (773) 834-6818 (fax) >> d.katz at ieee.org or dsk at ci.uchicago.edu >> http://www.ci.uchicago.edu/~dsk/ >> >> >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > -- Daniel S. Katz University of Chicago (773) 834-7186 (voice) (773) 834-6818 (fax) d.katz at ieee.org or dsk at ci.uchicago.edu http://www.ci.uchicago.edu/~dsk/ From davidk at ci.uchicago.edu Fri Sep 16 15:38:52 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 16 Sep 2011 15:38:52 -0500 (CDT) Subject: [Swift-devel] Integrity of trunk in SVN (was: Re: swift 0.93 deadlock) In-Reply-To: <379261672.13578.1316191494415.JavaMail.root@zimbra.anl.gov> Message-ID: <1143912327.108703.1316205532627.JavaMail.root@zimbra-mb2.anl.gov> > - its not related to the 0.93 deadlock that is holding back the SWAT > app, is it? > (ie not related to that email thread?) Right, it's not directly related to the deadlock. I mentioned it on that thread because we are now able to test with trunk on the MCS servers. > - can you trace back how the code got damaged in SVN, and see if > anything else got similarly back-leveled that we may not yet have > detected in trunk? I'll take a look at other commits around that time and see if there is anything else. > - can you take the action item of resuming nightly automated test > suite execution on trunk, and period (or even nightly) testing on the > latest release (to see if the suite catches occasional > sporadically-occuring bugs) Automated tests have been running nightly for a few weeks now on 0.93 and trunk. http://www.ci.uchicago.edu/swift/wwwdev/tests/tests.pl I will add automatic provider testing to that soon, and clean up the organization a bit. David > ----- Original Message ----- > > From: "David Kelly" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" > > Sent: Thursday, September 15, 2011 6:55:25 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > Persistent coasters in trunk is fixed now. I think an older version > > of > > coaster-service somehow got checked in, so I ran a reverse merge and > > resolved the conflicts. I tested on mcs workstations with 1000 cats > > and all seems well. > > > > David > > > > ----- Original Message ----- > > > From: "David Kelly" > > > To: "Michael Wilde" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > I was able to get it running on PADS with trunk. I ran into the > > > same > > > issue. > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > David > > > > > > ----- Original Message ----- > > > > From: "David Kelly" > > > > To: "Michael Wilde" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using > > > > passive > > > > persistent coasters. Is there a way to use automatic coasters on > > > > the > > > > MCS workstations? I'll try copying this over to PADS and running > > > > there > > > > to see if I can reproduce it. > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > From: "Michael Wilde" > > > > > To: "David Kelly" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > Can you make SWAT run under trunk, as Papia is testing using > > > > > standard > > > > > auto coasters, and doesnt need any of the missing > > > > > coaster-service > > > > > options. > > > > > > > > > > - Mike > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > I got past the compilation errors by renaming the all > > > > > > functions > > > > > > with > > > > > > capitalization, but ran into an issue with coaster-service. > > > > > > Last > > > > > > week > > > > > > I noticed coaster-service was missing options for dynamic > > > > > > ports. > > > > > > I > > > > > > found today that it is also missing -passive. I'll try to > > > > > > track > > > > > > down > > > > > > where this changed and restore the previous version. > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Michael Wilde" > > > > > > > To: "David Kelly" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > Excellent, thanks - thats good. I also just verified that > > > > > > > Papia > > > > > > > is > > > > > > > not > > > > > > > using the overAllocation tags in the sites file, so this > > > > > > > problem > > > > > > > is > > > > > > > clearly a Java deadlock and has nothing to do with the > > > > > > > scheduling > > > > > > > problem that the (now fixed) overAllocation problem was > > > > > > > causing.. > > > > > > > > > > > > > > My understanding is that this SWAT script is failing under > > > > > > > trunk > > > > > > > because of the recent token case handling issue (I think > > > > > > > the > > > > > > > camel-case one). Can you work with Papia to see if either > > > > > > > that > > > > > > > issue > > > > > > > is now fixed, or if her script can be changed to avoid > > > > > > > that, > > > > > > > so > > > > > > > that > > > > > > > you can both test the SWAT script with trunk, to see if > > > > > > > the > > > > > > > deadlock > > > > > > > still occurs? > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > I narrowed down the problem a bit. Last night I ran > > > > > > > > jstack > > > > > > > > on > > > > > > > > the > > > > > > > > wrong java process which is why it didn't report a > > > > > > > > deadlock. > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > My jstack: > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > Papia's jstack: > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Michael Wilde" > > > > > > > > > To: "David Kelly" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > David, it sounds like more analysis is needed here. If > > > > > > > > > the > > > > > > > > > SWAT > > > > > > > > > runs > > > > > > > > > are not showing a deadlock (but your runs are) then > > > > > > > > > likely > > > > > > > > > we > > > > > > > > > have > > > > > > > > > two > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing to > > > > > > > > > progress > > > > > > > > > is > > > > > > > > > due > > > > > > > > > to > > > > > > > > > the overAllocation parameter problem that Mihael fixed > > > > > > > > > yesterday. > > > > > > > > > The > > > > > > > > > symptom there is that Swift starts a coaster with a > > > > > > > > > time > > > > > > > > > slot > > > > > > > > > too > > > > > > > > > small for the apps in the script, and no apps wind up > > > > > > > > > running. > > > > > > > > > I > > > > > > > > > think > > > > > > > > > that situation in general merits a separate ticket, > > > > > > > > > and > > > > > > > > > may > > > > > > > > > have > > > > > > > > > been > > > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are > > > > > > > > > hanging > > > > > > > > > for > > > > > > > > > a > > > > > > > > > reason > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "David Kelly" > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > The jstack log corresponds to the most recent log > > > > > > > > > > file > > > > > > > > > > - > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > jstack does not report any deadlocks, but I thought > > > > > > > > > > it > > > > > > > > > > might > > > > > > > > > > be > > > > > > > > > > useful > > > > > > > > > > so I included it. Swift was not making any progress > > > > > > > > > > for > > > > > > > > > > about > > > > > > > > > > 5 > > > > > > > > > > hours > > > > > > > > > > before I sent the logs. I am running the latest 0.93 > > > > > > > > > > branch. > > > > > > > > > > I > > > > > > > > > > will > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > David, which of the many Swift logs in that /swat > > > > > > > > > > > dir > > > > > > > > > > > does > > > > > > > > > > > the > > > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are > > > > > > > > > > > running > > > > > > > > > > > on > > > > > > > > > > > the > > > > > > > > > > > latest > > > > > > > > > > > rev > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Michael > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > I was able to reproduce the problem with > > > > > > > > > > > > persistent > > > > > > > > > > > > coasters > > > > > > > > > > > > on > > > > > > > > > > > > the > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 > > > > > > > > > > > > > PM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > I think I am seeing a similar deadlock on > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > in > > > > > > > > > > > > > > the > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > and am trying to get a clean log and jstack > > > > > > > > > > > > > > to > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > but > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > Wilde" , "Michael P. > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > 1:56:13 > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > Attached are the jstack output and the log > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Michael Wilde > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Michael Wilde > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > -- > > > > > > > Michael Wilde > > > > > > > Computation Institute, University of Chicago > > > > > > > Mathematics and Computer Science Division > > > > > > > Argonne National Laboratory > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From bresnaha at mcs.anl.gov Sat Sep 17 03:02:37 2011 From: bresnaha at mcs.anl.gov (John Bresnahan) Date: Fri, 16 Sep 2011 22:02:37 -1000 Subject: [Swift-devel] globus provision and swift In-Reply-To: <12AB7A1A-2C45-4A53-9E79-42F399C95182@ci.uchicago.edu> References: <12AB7A1A-2C45-4A53-9E79-42F399C95182@ci.uchicago.edu> Message-ID: <4E74541D.8060604@mcs.anl.gov> The cloudinit.d tool is specifically for this kind of thing. It is like a general purpose globus-provision. Instead of being tailored for easy use with Globus, it has been created for a slightly more sophisticated user that wants create easy demos (and more) of their products. We and example of this with Cloud Foundry. http://www.nimbusproject.org/docs/5e22b3d5f2ed4028b96b2dcdd7c02eb4CIDTP/platform/cloudinitd/ On 09/16/2011 07:47 AM, Daniel S. Katz wrote: > I'm listening to Borjs'a talk on globus provision. > > An interesting project would be to see what could be done with this and Swift. For example, could we > build something similar that would let people try Swift on EC2 using this? Not sure this specific > instance makes sense, but the idea probably does. > > Dan > > -- > Daniel S. Katz > University of Chicago > (773) 834-7186 (voice) > (773) 834-6818 (fax) > d.katz at ieee.org or dsk at ci.uchicago.edu > http://www.ci.uchicago.edu/~dsk/ > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Sat Sep 17 07:22:29 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 17 Sep 2011 07:22:29 -0500 (CDT) Subject: [Swift-devel] Integrity of trunk in SVN (was: Re: swift 0.93 deadlock) In-Reply-To: <1143912327.108703.1316205532627.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <2122594985.15546.1316262149623.JavaMail.root@zimbra.anl.gov> > Automated tests have been running nightly for a few weeks now on 0.93 > and trunk. > > http://www.ci.uchicago.edu/swift/wwwdev/tests/tests.pl Thats excellent, David! Kudos and thanks to you and everyone else who enhanced the test suite: Justin, Alberto, Mihael, and everyone else who added tests! The HTML output looks great. > I will add automatic provider testing to that soon, and clean up the > organization a bit. Cool. Can we also add a test to re-produce the current 0.93 Java deadlock? If you can create a simpler test case that exhibits the problem, then Mihael can use it when he goes after this bug. Papia, can you distill the failing script down to something that can more readily be done in the test suite? I.e. something that runs on localhost, doesnt use Octave, and has trivial built-in files? Then put it in a shell script loop and see if we can make it deadlock (ie, cause the hang-checker to go off)? Thanks, - Mike From dsk at ci.uchicago.edu Sat Sep 17 07:25:23 2011 From: dsk at ci.uchicago.edu (Daniel S. Katz) Date: Sat, 17 Sep 2011 07:25:23 -0500 Subject: [Swift-devel] Integrity of trunk in SVN (was: Re: swift 0.93 deadlock) In-Reply-To: <2122594985.15546.1316262149623.JavaMail.root@zimbra.anl.gov> References: <2122594985.15546.1316262149623.JavaMail.root@zimbra.anl.gov> Message-ID: This is very good. Where are the tests being run? It looks like the current tests are all local, is this right? Dan On Sep 17, 2011, at 7:22 AM, Michael Wilde wrote: > >> Automated tests have been running nightly for a few weeks now on 0.93 >> and trunk. >> >> http://www.ci.uchicago.edu/swift/wwwdev/tests/tests.pl > > Thats excellent, David! Kudos and thanks to you and everyone else who enhanced the test suite: Justin, Alberto, Mihael, and everyone else who added tests! The HTML output looks great. > >> I will add automatic provider testing to that soon, and clean up the >> organization a bit. > > Cool. > > Can we also add a test to re-produce the current 0.93 Java deadlock? If you can create a simpler test case that exhibits the problem, then Mihael can use it when he goes after this bug. > > Papia, can you distill the failing script down to something that can more readily be done in the test suite? I.e. something that runs on localhost, doesnt use Octave, and has trivial built-in files? Then put it in a shell script loop and see if we can make it deadlock (ie, cause the hang-checker to go off)? > > Thanks, > > - Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Daniel S. Katz University of Chicago (773) 834-7186 (voice) (773) 834-6818 (fax) d.katz at ieee.org or dsk at ci.uchicago.edu http://www.ci.uchicago.edu/~dsk/ From wilde at mcs.anl.gov Sat Sep 17 07:41:17 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 17 Sep 2011 07:41:17 -0500 (CDT) Subject: [Swift-devel] Integrity of trunk in SVN (was: Re: swift 0.93 deadlock) In-Reply-To: Message-ID: <1211009823.15560.1316263277325.JavaMail.root@zimbra.anl.gov> Yes, I think thats right. I suspect they are being run on one of our two lab machines bridled and communicado, likely under David's login from cron. We just got a new CI login "swift" under which we will move all the nightly swift automated processed. And David has all the tests for remote systems close to fully automated; he already runs them under scripts. You can view the test suite framework and test groups in the Swift src tree: https://svn.ci.uchicago.edu/svn/vdl2/trunk/tests/ The remote systems tested are under tests/providers: beagle/ crow/ fusion/ intrepid/ local/ local-coasters/ mcs/ pads/ queenbee/ surveyor/ ----- Original Message ----- > From: "Daniel S. Katz" > To: "Michael Wilde" > Cc: "David Kelly" , "Papia Rizwan" , "swift-devel Devel" > > Sent: Saturday, September 17, 2011 7:25:23 AM > Subject: Re: [Swift-devel] Integrity of trunk in SVN (was: Re: swift 0.93 deadlock) > This is very good. > > Where are the tests being run? It looks like the current tests are all > local, is this right? > > Dan > > > On Sep 17, 2011, at 7:22 AM, Michael Wilde wrote: > > > > >> Automated tests have been running nightly for a few weeks now on > >> 0.93 > >> and trunk. > >> > >> http://www.ci.uchicago.edu/swift/wwwdev/tests/tests.pl > > > > Thats excellent, David! Kudos and thanks to you and everyone else > > who enhanced the test suite: Justin, Alberto, Mihael, and everyone > > else who added tests! The HTML output looks great. > > > >> I will add automatic provider testing to that soon, and clean up > >> the > >> organization a bit. > > > > Cool. > > > > Can we also add a test to re-produce the current 0.93 Java deadlock? > > If you can create a simpler test case that exhibits the problem, > > then Mihael can use it when he goes after this bug. > > > > Papia, can you distill the failing script down to something that can > > more readily be done in the test suite? I.e. something that runs on > > localhost, doesnt use Octave, and has trivial built-in files? Then > > put it in a shell script loop and see if we can make it deadlock > > (ie, cause the hang-checker to go off)? > > > > Thanks, > > > > - Mike > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Daniel S. Katz > University of Chicago > (773) 834-7186 (voice) > (773) 834-6818 (fax) > d.katz at ieee.org or dsk at ci.uchicago.edu > http://www.ci.uchicago.edu/~dsk/ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Sat Sep 17 23:36:25 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 17 Sep 2011 21:36:25 -0700 Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <248843162.13624.1316191855001.JavaMail.root@zimbra.anl.gov> References: <248843162.13624.1316191855001.JavaMail.root@zimbra.anl.gov> Message-ID: <1316320585.17056.0.camel@blabla> I have a tentative fix in the branch and trunk. Revisions 5123 and 5124, respectively. Please let me know how that works out. Mihae On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > David and Papia, can you report to the list what the status is of running the SWAT app? > > - I understand that Mihael will work on the 0.93 deadlock fix this weekend, which is great. > > - I understand that its happening on trunk as well > > - Papia, can you try to "perturb" the Swift code in the hopes that some equivalent but different code doesnt trip into the same bug? Ie try a different mapper, different variable strategy (ie arrays vs scalars, structs vs separate vars) just to see if you can work around this? Or, put in some shell logic to catch the hang and kill and re-run (or resume) Swift? if you just kill a hung script and then resume it, will it work? We could maybe alter the hang checker to kill swift on its own, with a return code or message that you could use to trigger a resume. > > Mike > > > ----- Original Message ----- > > From: "David Kelly" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia Rizwan" > > Sent: Thursday, September 15, 2011 4:34:02 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > I was able to get it running on PADS with trunk. I ran into the same > > issue. > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > David > > > > ----- Original Message ----- > > > From: "David Kelly" > > > To: "Michael Wilde" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using passive > > > persistent coasters. Is there a way to use automatic coasters on the > > > MCS workstations? I'll try copying this over to PADS and running > > > there > > > to see if I can reproduce it. > > > > > > David > > > > > > ----- Original Message ----- > > > > From: "Michael Wilde" > > > > To: "David Kelly" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Mihael Hategan" > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > Can you make SWAT run under trunk, as Papia is testing using > > > > standard > > > > auto coasters, and doesnt need any of the missing coaster-service > > > > options. > > > > > > > > - Mike > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "David Kelly" > > > > > To: "Michael Wilde" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > I got past the compilation errors by renaming the all functions > > > > > with > > > > > capitalization, but ran into an issue with coaster-service. Last > > > > > week > > > > > I noticed coaster-service was missing options for dynamic ports. > > > > > I > > > > > found today that it is also missing -passive. I'll try to track > > > > > down > > > > > where this changed and restore the previous version. > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Michael Wilde" > > > > > > To: "David Kelly" > > > > > > Cc: "swift-devel Devel" , "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > Excellent, thanks - thats good. I also just verified that > > > > > > Papia > > > > > > is > > > > > > not > > > > > > using the overAllocation tags in the sites file, so this > > > > > > problem > > > > > > is > > > > > > clearly a Java deadlock and has nothing to do with the > > > > > > scheduling > > > > > > problem that the (now fixed) overAllocation problem was > > > > > > causing.. > > > > > > > > > > > > My understanding is that this SWAT script is failing under > > > > > > trunk > > > > > > because of the recent token case handling issue (I think the > > > > > > camel-case one). Can you work with Papia to see if either that > > > > > > issue > > > > > > is now fixed, or if her script can be changed to avoid that, > > > > > > so > > > > > > that > > > > > > you can both test the SWAT script with trunk, to see if the > > > > > > deadlock > > > > > > still occurs? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "David Kelly" > > > > > > > To: "Michael Wilde" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > I narrowed down the problem a bit. Last night I ran jstack > > > > > > > on > > > > > > > the > > > > > > > wrong java process which is why it didn't report a deadlock. > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > My jstack: > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > Papia's jstack: > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > David > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Michael Wilde" > > > > > > > > To: "David Kelly" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > David, it sounds like more analysis is needed here. If the > > > > > > > > SWAT > > > > > > > > runs > > > > > > > > are not showing a deadlock (but your runs are) then likely > > > > > > > > we > > > > > > > > have > > > > > > > > two > > > > > > > > different problems here. > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing to > > > > > > > > progress > > > > > > > > is > > > > > > > > due > > > > > > > > to > > > > > > > > the overAllocation parameter problem that Mihael fixed > > > > > > > > yesterday. > > > > > > > > The > > > > > > > > symptom there is that Swift starts a coaster with a time > > > > > > > > slot > > > > > > > > too > > > > > > > > small for the apps in the script, and no apps wind up > > > > > > > > running. > > > > > > > > I > > > > > > > > think > > > > > > > > that situation in general merits a separate ticket, and > > > > > > > > may > > > > > > > > have > > > > > > > > been > > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are hanging > > > > > > > > for > > > > > > > > a > > > > > > > > reason > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "David Kelly" > > > > > > > > > To: "Michael Wilde" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > The jstack log corresponds to the most recent log file - > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > jstack does not report any deadlocks, but I thought it > > > > > > > > > might > > > > > > > > > be > > > > > > > > > useful > > > > > > > > > so I included it. Swift was not making any progress for > > > > > > > > > about > > > > > > > > > 5 > > > > > > > > > hours > > > > > > > > > before I sent the logs. I am running the latest 0.93 > > > > > > > > > branch. > > > > > > > > > I > > > > > > > > > will > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > To: "David Kelly" > > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > David, which of the many Swift logs in that /swat dir > > > > > > > > > > does > > > > > > > > > > the > > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are running > > > > > > > > > > on > > > > > > > > > > the > > > > > > > > > > latest > > > > > > > > > > rev > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Michael Wilde" > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > I was able to reproduce the problem with persistent > > > > > > > > > > > coasters > > > > > > > > > > > on > > > > > > > > > > > the > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde > > > > > > > > > > > > wrote: > > > > > > > > > > > > > I think I am seeing a similar deadlock on 0.93 > > > > > > > > > > > > > in > > > > > > > > > > > > > the > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > script, > > > > > > > > > > > > > and am trying to get a clean log and jstack to > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the > > > > > > > > > > > > > correct > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > code, > > > > > > > > > > > > > but > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > Wilde" , "Michael P. > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > Attached are the jstack output and the log > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Michael Wilde > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > -- > > > > > > > > Michael Wilde > > > > > > > > Computation Institute, University of Chicago > > > > > > > > Mathematics and Computer Science Division > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > -- > > > > > > Michael Wilde > > > > > > Computation Institute, University of Chicago > > > > > > Mathematics and Computer Science Division > > > > > > Argonne National Laboratory > > > > > > > > -- > > > > Michael Wilde > > > > Computation Institute, University of Chicago > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Sep 19 11:06:17 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 Sep 2011 09:06:17 -0700 Subject: [Swift-devel] case sensitivity Message-ID: <1316448377.8167.4.camel@blabla> Things have now become truly case sensitive in trunk. Which means that there is a distinction between toint and toInt. I opted for toInt for now. In any event, your scripts may break if they use the wrong capitalization. We should document this, and, before releasing 0.94, decide on the details of the capitalization of various things. Mihael From wilde at mcs.anl.gov Mon Sep 19 11:44:12 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 19 Sep 2011 11:44:12 -0500 (CDT) Subject: [Swift-devel] case sensitivity In-Reply-To: <1316448377.8167.4.camel@blabla> Message-ID: <1695577466.18235.1316450652253.JavaMail.root@zimbra.anl.gov> I feel that backwards compatibility - at least for some length period of time, but better yet indefinitely - needs to take priority over aesthetic issues. So I can see providing and documenting a function like toInt() while indefinitely allowing toint(). I agree that we should be consistent is using camelCase. However, until Yadu's email below I was never aware of any case insensitivity in Swift - I always programmed as if all names were case sensitive. I didnt pay close attention to the brief list discussion on case below. Was there any follow-up discussion? Can you summarize what the issues are? I believe that Swift should be fully case sensitive for all names, and that we should not cause existing user code to break unless the cost of such backwards compatibility is much worse than the breakage. - Mike ----- Forwarded Message ----- From: "Yadu Nand" To: "Ben Clifford" Cc: "swift-devel" Sent: Tuesday, August 9, 2011 10:01:22 AM Subject: Re: [Swift-devel] Overwriting procedures in swift. > what happens with case? (and, what *should* happen with case?) (int o) f ( int i) { o = i; } (int z) F (int a){ z = a * 5 ; } trace ( f (5) , F(5) ); for the above snippet, trace returns 25, 25. So F is overwriting f anyway. I don't think this is right. > I think karajan identifiers are case insensitive (?) but this patch looks > like it is case-sensitive. Fixed it. Please check the new patch attached. ----- Original Message ----- > From: "Mihael Hategan" > To: "Swift Devel" > Sent: Monday, September 19, 2011 11:06:17 AM > Subject: [Swift-devel] case sensitivity > Things have now become truly case sensitive in trunk. Which means that > there is a distinction between toint and toInt. I opted for toInt for > now. In any event, your scripts may break if they use the wrong > capitalization. > > We should document this, and, before releasing 0.94, decide on the > details of the capitalization of various things. > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wozniak at mcs.anl.gov Mon Sep 19 14:00:24 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 19 Sep 2011 14:00:24 -0500 (Central Daylight Time) Subject: [Swift-devel] case sensitivity In-Reply-To: <1316448377.8167.4.camel@blabla> References: <1316448377.8167.4.camel@blabla> Message-ID: Can you update test 0053-toint.swift ? Thanks On Mon, 19 Sep 2011, Mihael Hategan wrote: > Things have now become truly case sensitive in trunk. Which means that > there is a distinction between toint and toInt. I opted for toInt for > now. In any event, your scripts may break if they use the wrong > capitalization. > > We should document this, and, before releasing 0.94, decide on the > details of the capitalization of various things. > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -- Justin M Wozniak From wilde at mcs.anl.gov Mon Sep 19 14:02:47 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 19 Sep 2011 14:02:47 -0500 (CDT) Subject: [Swift-devel] case sensitivity In-Reply-To: <1695577466.18235.1316450652253.JavaMail.root@zimbra.anl.gov> Message-ID: <1423072562.18954.1316458967049.JavaMail.root@zimbra.anl.gov> I think toInt is broken in trunk. I just svn up'ed my trunk and cant seem to get any variant of toint() to work. They all give: fusion$ cat ti.swift int i = @toInt("123"); fusion$ swift ti.swift no sites file specified, setting to default: /homes/wilde/swift/src/trunk/cog/modules/swift/dist/swift-svn/etc/sites.xml Could not start execution: Compile error in assignment at line 2: Unknown function: @toInt: Unknown function: @toInt fusion$ I tried @toint(), @toInt() and toInt(). - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" > Cc: "Swift Devel" > Sent: Monday, September 19, 2011 11:44:12 AM > Subject: Re: [Swift-devel] case sensitivity > I feel that backwards compatibility - at least for some length period > of time, but better yet indefinitely - needs to take priority over > aesthetic issues. > > So I can see providing and documenting a function like toInt() while > indefinitely allowing toint(). I agree that we should be consistent is > using camelCase. > > However, until Yadu's email below I was never aware of any case > insensitivity in Swift - I always programmed as if all names were case > sensitive. I didnt pay close attention to the brief list discussion on > case below. Was there any follow-up discussion? > > Can you summarize what the issues are? > > I believe that Swift should be fully case sensitive for all names, and > that we should not cause existing user code to break unless the cost > of such backwards compatibility is much worse than the breakage. > > - Mike > > ----- Forwarded Message ----- > From: "Yadu Nand" > To: "Ben Clifford" > Cc: "swift-devel" > Sent: Tuesday, August 9, 2011 10:01:22 AM > Subject: Re: [Swift-devel] Overwriting procedures in swift. > > > what happens with case? (and, what *should* happen with case?) > > (int o) f ( int i) { > o = i; > } > (int z) F (int a){ > z = a * 5 ; > } > trace ( f (5) , F(5) ); > > for the above snippet, trace returns 25, 25. > So F is overwriting f anyway. I don't think this is right. > > > I think karajan identifiers are case insensitive (?) but this patch > > looks > > like it is case-sensitive. > > Fixed it. Please check the new patch attached. > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Swift Devel" > > Sent: Monday, September 19, 2011 11:06:17 AM > > Subject: [Swift-devel] case sensitivity > > Things have now become truly case sensitive in trunk. Which means > > that > > there is a distinction between toint and toInt. I opted for toInt > > for > > now. In any event, your scripts may break if they use the wrong > > capitalization. > > > > We should document this, and, before releasing 0.94, decide on the > > details of the capitalization of various things. > > > > Mihael > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Mon Sep 19 14:17:45 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 19 Sep 2011 14:17:45 -0500 (CDT) Subject: [Swift-devel] Trunk gives concurrentModification exception In-Reply-To: <449333108.19016.1316459752326.JavaMail.root@zimbra.anl.gov> Message-ID: <2012606678.19020.1316459865826.JavaMail.root@zimbra.anl.gov> Im seeing this error on trunk (as of earlier this morning). It seems to be occurring transiently. One swift invocation will get the exception; the next one will work, etc. This run was on fusion; log is on CI net under ~wilde/pagoda2-20110919-1412-ek0v688b.log - Mike fusion$ more swift.out Swift svn swift-r5131 (swift modified locally) cog-r3286 RunID: 20110919-1412-ek0v688b Progress: time: Mon, 19 Sep 2011 14:12:05 -0500 Ex098 java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793) at java.util.HashMap$KeyIterator.next(HashMap.java:828) at java.util.AbstractCollection.toArray(AbstractCollection.java:124) at java.util.ArrayList.addAll(ArrayList.java:472) at org.griphyn.vdl.karajan.ArrayIndexFutureList.(ArrayIndexFutureList.java:33) at org.griphyn.vdl.mapping.ArrayDataNode.getFutureWrapper(ArrayDataNode.java:95) at org.griphyn.vdl.mapping.ArrayDataNode.getFutureList(ArrayDataNode.java:102) at org.griphyn.vdl.karajan.lib.GetArrayIterator.function(GetArrayIterator.java:50) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.FlowElementWrapper.start(FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:98) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Execution failed: null -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From davidk at ci.uchicago.edu Mon Sep 19 22:20:16 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Mon, 19 Sep 2011 22:20:16 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1316320585.17056.0.camel@blabla> Message-ID: <959559750.111934.1316488816675.JavaMail.root@zimbra-mb2.anl.gov> I tried today with the 0.93 update. It ran for approximately 7 hours before freezing. It looks to be happening in a different place this time. http://www.ci.uchicago.edu/~davidk/swat4/jstack.log http://www.ci.uchicago.edu/~davidk/swat4/cce_ua-20110919-1955-h7t8iui2.log David ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "David Kelly" , "swift-devel Devel" , "Papia Rizwan" > > Sent: Saturday, September 17, 2011 11:36:25 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > I have a tentative fix in the branch and trunk. Revisions 5123 and > 5124, > respectively. Please let me know how that works out. > > Mihae > > On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > > David and Papia, can you report to the list what the status is of > > running the SWAT app? > > > > - I understand that Mihael will work on the 0.93 deadlock fix this > > weekend, which is great. > > > > - I understand that its happening on trunk as well > > > > - Papia, can you try to "perturb" the Swift code in the hopes that > > some equivalent but different code doesnt trip into the same bug? Ie > > try a different mapper, different variable strategy (ie arrays vs > > scalars, structs vs separate vars) just to see if you can work > > around this? Or, put in some shell logic to catch the hang and kill > > and re-run (or resume) Swift? if you just kill a hung script and > > then resume it, will it work? We could maybe alter the hang checker > > to kill swift on its own, with a return code or message that you > > could use to trigger a resume. > > > > Mike > > > > > > ----- Original Message ----- > > > From: "David Kelly" > > > To: "Michael Wilde" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > I was able to get it running on PADS with trunk. I ran into the > > > same > > > issue. > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > David > > > > > > ----- Original Message ----- > > > > From: "David Kelly" > > > > To: "Michael Wilde" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using > > > > passive > > > > persistent coasters. Is there a way to use automatic coasters on > > > > the > > > > MCS workstations? I'll try copying this over to PADS and running > > > > there > > > > to see if I can reproduce it. > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > From: "Michael Wilde" > > > > > To: "David Kelly" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > Can you make SWAT run under trunk, as Papia is testing using > > > > > standard > > > > > auto coasters, and doesnt need any of the missing > > > > > coaster-service > > > > > options. > > > > > > > > > > - Mike > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > I got past the compilation errors by renaming the all > > > > > > functions > > > > > > with > > > > > > capitalization, but ran into an issue with coaster-service. > > > > > > Last > > > > > > week > > > > > > I noticed coaster-service was missing options for dynamic > > > > > > ports. > > > > > > I > > > > > > found today that it is also missing -passive. I'll try to > > > > > > track > > > > > > down > > > > > > where this changed and restore the previous version. > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Michael Wilde" > > > > > > > To: "David Kelly" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > Excellent, thanks - thats good. I also just verified that > > > > > > > Papia > > > > > > > is > > > > > > > not > > > > > > > using the overAllocation tags in the sites file, so this > > > > > > > problem > > > > > > > is > > > > > > > clearly a Java deadlock and has nothing to do with the > > > > > > > scheduling > > > > > > > problem that the (now fixed) overAllocation problem was > > > > > > > causing.. > > > > > > > > > > > > > > My understanding is that this SWAT script is failing under > > > > > > > trunk > > > > > > > because of the recent token case handling issue (I think > > > > > > > the > > > > > > > camel-case one). Can you work with Papia to see if either > > > > > > > that > > > > > > > issue > > > > > > > is now fixed, or if her script can be changed to avoid > > > > > > > that, > > > > > > > so > > > > > > > that > > > > > > > you can both test the SWAT script with trunk, to see if > > > > > > > the > > > > > > > deadlock > > > > > > > still occurs? > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > I narrowed down the problem a bit. Last night I ran > > > > > > > > jstack > > > > > > > > on > > > > > > > > the > > > > > > > > wrong java process which is why it didn't report a > > > > > > > > deadlock. > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > My jstack: > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > Papia's jstack: > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Michael Wilde" > > > > > > > > > To: "David Kelly" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > David, it sounds like more analysis is needed here. If > > > > > > > > > the > > > > > > > > > SWAT > > > > > > > > > runs > > > > > > > > > are not showing a deadlock (but your runs are) then > > > > > > > > > likely > > > > > > > > > we > > > > > > > > > have > > > > > > > > > two > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing to > > > > > > > > > progress > > > > > > > > > is > > > > > > > > > due > > > > > > > > > to > > > > > > > > > the overAllocation parameter problem that Mihael fixed > > > > > > > > > yesterday. > > > > > > > > > The > > > > > > > > > symptom there is that Swift starts a coaster with a > > > > > > > > > time > > > > > > > > > slot > > > > > > > > > too > > > > > > > > > small for the apps in the script, and no apps wind up > > > > > > > > > running. > > > > > > > > > I > > > > > > > > > think > > > > > > > > > that situation in general merits a separate ticket, > > > > > > > > > and > > > > > > > > > may > > > > > > > > > have > > > > > > > > > been > > > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are > > > > > > > > > hanging > > > > > > > > > for > > > > > > > > > a > > > > > > > > > reason > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "David Kelly" > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > The jstack log corresponds to the most recent log > > > > > > > > > > file - > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > jstack does not report any deadlocks, but I thought > > > > > > > > > > it > > > > > > > > > > might > > > > > > > > > > be > > > > > > > > > > useful > > > > > > > > > > so I included it. Swift was not making any progress > > > > > > > > > > for > > > > > > > > > > about > > > > > > > > > > 5 > > > > > > > > > > hours > > > > > > > > > > before I sent the logs. I am running the latest 0.93 > > > > > > > > > > branch. > > > > > > > > > > I > > > > > > > > > > will > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > David, which of the many Swift logs in that /swat > > > > > > > > > > > dir > > > > > > > > > > > does > > > > > > > > > > > the > > > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are > > > > > > > > > > > running > > > > > > > > > > > on > > > > > > > > > > > the > > > > > > > > > > > latest > > > > > > > > > > > rev > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Michael > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > I was able to reproduce the problem with > > > > > > > > > > > > persistent > > > > > > > > > > > > coasters > > > > > > > > > > > > on > > > > > > > > > > > > the > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 > > > > > > > > > > > > > PM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > I think I am seeing a similar deadlock on > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > in > > > > > > > > > > > > > > the > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > and am trying to get a clean log and jstack > > > > > > > > > > > > > > to > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > but > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > Wilde" , "Michael P. > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > 1:56:13 PM > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > Attached are the jstack output and the log > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Michael Wilde > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Michael Wilde > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > -- > > > > > > > Michael Wilde > > > > > > > Computation Institute, University of Chicago > > > > > > > Mathematics and Computer Science Division > > > > > > > Argonne National Laboratory > > > > > > > > > > -- > > > > > Michael Wilde > > > > > Computation Institute, University of Chicago > > > > > Mathematics and Computer Science Division > > > > > Argonne National Laboratory > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > From wilde at mcs.anl.gov Mon Sep 19 23:01:22 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 19 Sep 2011 23:01:22 -0500 (CDT) Subject: [Swift-devel] NPE when cpus register to persistent coaster service In-Reply-To: <161734697.20386.1316491214474.JavaMail.root@zimbra.anl.gov> Message-ID: <799928578.20388.1316491282825.JavaMail.root@zimbra.anl.gov> Im seeing this error in the service log (in trunk) when workers register CPUs with the service: (Swift svn swift-r5131 (swift modified locally) cog-r3286) 2011-09-19 22:52:57,882-0500 INFO AbstractStreamKarajanChannel nullChannel started 2011-09-19 22:52:57,882-0500 INFO AbstractStreamKarajanChannel$Multiplexer (0) Scheduling SC-null for addition 2011-09-19 22:52:57,882-0500 INFO AbstractStreamKarajanChannel nullChannel started 2011-09-19 22:52:57,907-0500 INFO LocalTCPService Received registration: blockid = swork3, url = f1 2011-09-19 22:52:57,917-0500 INFO AbstractStreamKarajanChannel$Multiplexer (0) Scheduling SC-null for addition 2011-09-19 22:52:57,917-0500 INFO AbstractStreamKarajanChannel nullChannel started 2011-09-19 22:52:57,949-0500 INFO MetaChannel MetaChannel: 381531395[1729747990: {}] -> null.bind -> SC-null 2011-09-19 22:52:57,952-0500 DEBUG Cpu worker started: block=swork3 host=f1 id=0 2011-09-19 22:52:57,952-0500 DEBUG Cpu ready for work: block=swork3 id=0 2011-09-19 22:52:57,953-0500 INFO Block Started CPU 0:1316490777s 2011-09-19 22:52:57,953-0500 TRACE Cpu swork3:0 pull 2011-09-19 22:52:57,953-0500 INFO Block Started worker swork3:000000 2011-09-19 22:52:57,953-0500 DEBUG Cpu requesting work: block=swork3 id=0 Cpus sleeping: 1 2011-09-19 22:52:57,953-0500 DEBUG Cpu swork3:0 sleeping 2011-09-19 22:52:57,954-0500 DEBUG PullThread sleep: 0:1316490777s 2011-09-19 22:52:57,955-0500 WARN BlockQueueProcessor Failed to send worker status update to client java.lang.NullPointerException at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.j ava:72) at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) at org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) at org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChanne l.java:375) 2011-09-19 22:52:57,959-0500 INFO LocalTCPService Received registration: blockid = swork6, url = f1 2011-09-19 22:52:57,959-0500 INFO MetaChannel MetaChannel: 558519794[723566380: {}] -> null.bind -> SC-null 2011-09-19 22:52:57,959-0500 DEBUG Cpu worker started: block=swork6 host=f1 id=0 2011-09-19 22:52:57,959-0500 DEBUG Cpu ready for work: block=swork6 id=0 2011-09-19 22:52:57,959-0500 INFO Block Started CPU 0:1316490777s 2011-09-19 22:52:57,959-0500 TRACE Cpu swork6:0 pull 2011-09-19 22:52:57,960-0500 INFO Block Started worker swork6:000000 2011-09-19 22:52:57,960-0500 DEBUG Cpu requesting work: block=swork6 id=0 Cpus sleeping: 2 2011-09-19 22:52:57,960-0500 WARN BlockQueueProcessor Failed to send worker status update to client java.lang.NullPointerException at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.j ava:72) at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) at org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) at org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChanne l.java:375) The service still seems to work. - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Mon Sep 19 23:36:52 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 Sep 2011 21:36:52 -0700 Subject: [Swift-devel] case sensitivity In-Reply-To: <1695577466.18235.1316450652253.JavaMail.root@zimbra.anl.gov> References: <1695577466.18235.1316450652253.JavaMail.root@zimbra.anl.gov> Message-ID: <1316493412.9498.4.camel@blabla> On Mon, 2011-09-19 at 11:44 -0500, Michael Wilde wrote: > I feel that backwards compatibility - at least for some length period > of time, but better yet indefinitely - needs to take priority over > aesthetic issues. I suppose we could have both toInt and toint, but it seems a bit shady in the long run. Perhaps the best choice would be to do that for now and generate a warning message when the undesired form is encountered. > > So I can see providing and documenting a function like toInt() while > indefinitely allowing toint(). I agree that we should be consistent is > using camelCase. > > However, until Yadu's email below I was never aware of any case > insensitivity in Swift - I always programmed as if all names were case > sensitive. I didnt pay close attention to the brief list discussion on > case below. Was there any follow-up discussion? > > Can you summarize what the issues are? Swift shouldn't allow you to define f twice. We agreed that it can only be a mistake to have a function defined twice (since there is no way to access the first - lexically - definition). So Yadu provided a patch to detect that. But then the question was what would happen if both f and F were defined. From swift's case sensitive perspective it would be ok. But in the case insensitive implementation F would override f. So that had to be fixed. > > I believe that Swift should be fully case sensitive for all names, and > that we should not cause existing user code to break unless the cost > of such backwards compatibility is much worse than the breakage. I agree. From hategan at mcs.anl.gov Mon Sep 19 23:39:23 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 19 Sep 2011 21:39:23 -0700 Subject: [Swift-devel] case sensitivity In-Reply-To: <1423072562.18954.1316458967049.JavaMail.root@zimbra.anl.gov> References: <1423072562.18954.1316458967049.JavaMail.root@zimbra.anl.gov> Message-ID: <1316493563.9498.5.camel@blabla> Ooops. Yes. Seems like there's more to this than what I did. I should have a fix in tomorrow. Mihael On Mon, 2011-09-19 at 14:02 -0500, Michael Wilde wrote: > I think toInt is broken in trunk. > > I just svn up'ed my trunk and cant seem to get any variant of toint() to work. They all give: > > fusion$ cat ti.swift > int i = @toInt("123"); > fusion$ swift ti.swift > no sites file specified, setting to default: /homes/wilde/swift/src/trunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > Could not start execution: > Compile error in assignment at line 2: Unknown function: @toInt: > Unknown function: @toInt > fusion$ > > I tried @toint(), @toInt() and toInt(). > > - Mike > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Mihael Hategan" > > Cc: "Swift Devel" > > Sent: Monday, September 19, 2011 11:44:12 AM > > Subject: Re: [Swift-devel] case sensitivity > > I feel that backwards compatibility - at least for some length period > > of time, but better yet indefinitely - needs to take priority over > > aesthetic issues. > > > > So I can see providing and documenting a function like toInt() while > > indefinitely allowing toint(). I agree that we should be consistent is > > using camelCase. > > > > However, until Yadu's email below I was never aware of any case > > insensitivity in Swift - I always programmed as if all names were case > > sensitive. I didnt pay close attention to the brief list discussion on > > case below. Was there any follow-up discussion? > > > > Can you summarize what the issues are? > > > > I believe that Swift should be fully case sensitive for all names, and > > that we should not cause existing user code to break unless the cost > > of such backwards compatibility is much worse than the breakage. > > > > - Mike > > > > ----- Forwarded Message ----- > > From: "Yadu Nand" > > To: "Ben Clifford" > > Cc: "swift-devel" > > Sent: Tuesday, August 9, 2011 10:01:22 AM > > Subject: Re: [Swift-devel] Overwriting procedures in swift. > > > > > what happens with case? (and, what *should* happen with case?) > > > > (int o) f ( int i) { > > o = i; > > } > > (int z) F (int a){ > > z = a * 5 ; > > } > > trace ( f (5) , F(5) ); > > > > for the above snippet, trace returns 25, 25. > > So F is overwriting f anyway. I don't think this is right. > > > > > I think karajan identifiers are case insensitive (?) but this patch > > > looks > > > like it is case-sensitive. > > > > Fixed it. Please check the new patch attached. > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Swift Devel" > > > Sent: Monday, September 19, 2011 11:06:17 AM > > > Subject: [Swift-devel] case sensitivity > > > Things have now become truly case sensitive in trunk. Which means > > > that > > > there is a distinction between toint and toInt. I opted for toInt > > > for > > > now. In any event, your scripts may break if they use the wrong > > > capitalization. > > > > > > We should document this, and, before releasing 0.94, decide on the > > > details of the capitalization of various things. > > > > > > Mihael > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Tue Sep 20 12:02:20 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 20 Sep 2011 12:02:20 -0500 (CDT) Subject: [Swift-devel] ParVis AMWG script fails to compile on trunk: unexpected character in markup Message-ID: <2021384179.22298.1316538140935.JavaMail.root@zimbra.anl.gov> Trunk is giving the following error when trying to run the ParVis AMWG script (Im using -typecheck here but the same failure occurs without -typecheck): fusion$ swift -typecheck amwg_stats.swift no sites file specified, setting to default: /homes/wilde/swift/src/trunk/cog/modules/swift/dist/swift-svn/etc/sites.xml Could not start execution: Error reading source: : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) : : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) : : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) : : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) : : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) 0.93 seems to compile it OK, but note what looks like a bug: -typecheck doesnt stop swift from trying to execute the script: fusion$ ~/swift/src/0.93/cog/modules/swift/dist/swift-svn/bin/swift -typecheck amwg_stats.swift no sites file specified, setting to default: /homes/wilde/swift/src/trunk/cog/modules/swift/dist/swift-svn/etc/sites.xml Swift svn swift-r5105 cog-r3262 RunID: 20110920-1157-art6r730 Progress: time: Tue, 20 Sep 2011 11:57:56 -0500 Execution failed: Missing command line argument: test_case SwiftScript trace: cntl_case: dummy fusion$ -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Tue Sep 20 12:26:56 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 20 Sep 2011 12:26:56 -0500 (CDT) Subject: [Swift-devel] ParVis AMWG script fails to compile on trunk: unexpected character in markup In-Reply-To: <2021384179.22298.1316538140935.JavaMail.root@zimbra.anl.gov> Message-ID: <910809475.22447.1316539616146.JavaMail.root@zimbra.anl.gov> The log and .swift script for this error are on the CI net under ~wilde: amwg_stats-20110920-1142-u7lz8tog.log and amwg_stats.swift - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" > Cc: "Swift Devel" > Sent: Tuesday, September 20, 2011 12:02:20 PM > Subject: [Swift-devel] ParVis AMWG script fails to compile on trunk: unexpected character in markup > Trunk is giving the following error when trying to run the ParVis AMWG > script (Im using -typecheck here but the same failure occurs without > -typecheck): > > fusion$ swift -typecheck amwg_stats.swift > no sites file specified, setting to default: > /homes/wilde/swift/src/trunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > Could not start execution: > Error reading source: : unexpected character in markup - (position: > START_TAG seen ...\n <-... @2063:13) : : unexpected > character in markup - (position: START_TAG seen ...\n > <-... @2063:13) : > : unexpected character in markup - (position: START_TAG seen > ...\n <-... @2063:13) : : unexpected character in markup - > (position: START_TAG seen ...\n <-... @2063:13) : > : unexpected character in markup - (position: START_TAG seen > ...\n <-... @2063:13) : > unexpected character in markup - (position: START_TAG seen > ...\n <-... @2063:13) > > > 0.93 seems to compile it OK, but note what looks like a bug: > -typecheck doesnt stop swift from trying to execute the script: > > > fusion$ ~/swift/src/0.93/cog/modules/swift/dist/swift-svn/bin/swift > -typecheck amwg_stats.swift > no sites file specified, setting to default: > /homes/wilde/swift/src/trunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > Swift svn swift-r5105 cog-r3262 > > RunID: 20110920-1157-art6r730 > Progress: time: Tue, 20 Sep 2011 11:57:56 -0500 > Execution failed: > Missing command line argument: test_case > SwiftScript trace: cntl_case: dummy > fusion$ > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From tim.g.armstrong at gmail.com Tue Sep 20 14:00:37 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Tue, 20 Sep 2011 14:00:37 -0500 Subject: [Swift-devel] Swift 0.93 exception on Fusion In-Reply-To: <824503992.69090.1313610821222.JavaMail.root@zimbra-mb2.anl.gov> References: <1313610458.24629.0.camel@blabla> <824503992.69090.1313610821222.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: This looks like the same bug I encountered in 0.92.1: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=569 This affects SwiftR so I'm trying to work out what my options are to resolve it. Is it possible to backport the fix to the 0.92 release? I can do this myself, or simply patch my own version of the source if you let me know the svn revision. I'm also not sure what the overall best approach for bundling Swift with SwiftR is. My current approach is to simply include the most recent stable Swift release, but I seem to be missing out on relevant bugfixes from trunk. I'm looking in the near future to start distributed SwiftR more widely as a downloadable R package so pushing out bugfixes may become more of a pain. Any thoughts on which version of Swift I should bundle with SwiftR in order to maximise stability? - Tim On Wed, Aug 17, 2011 at 2:53 PM, David Kelly wrote: > Thanks Mihael, that fixed it. > > David > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "David Kelly" > > Cc: swift-devel at ci.uchicago.edu > > Sent: Wednesday, August 17, 2011 2:47:38 PM > > Subject: Re: [Swift-devel] Swift 0.93 exception on Fusion > > Fixed in svn. > > > > On Wed, 2011-08-17 at 11:34 -0500, David Kelly wrote: > > > Hello, > > > > > > When testing 0.93 on Fusion, Swift throws an exception. I am running > > > with the catsn script. It runs, creates the output, but then gives > > > this error when cleaning up: > > > > > > Final status: time: Wed, 17 Aug 2011 11:17:38 -0500 Finished > > > successfully:10 > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Cannot submit job > > > at > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:67) > > > at > > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:45) > > > at > > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:57) > > > at > > > > org.globus.cog.abstraction.coaster.service.job.manager.LocalQueueProcessor.run(LocalQueueProcessor.java:40) > > > Caused by: java.lang.NullPointerException > > > at > > > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.makeName(PBSExecutor.java:304) > > > at > > > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor.writeScript(PBSExecutor.java:205) > > > at > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.buildCommandLine(AbstractExecutor.java:169) > > > at > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:89) > > > at > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > > > ... 3 more > > > > > > The config and log files are attached. They can also be found on > > > fusion in ~davidk/temp. This is filed in bugzilla as bug #515. > > > > > > David > > > _______________________________________________ Swift-devel mailing > > > list Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonmon at mcs.anl.gov Tue Sep 20 16:25:44 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Tue, 20 Sep 2011 16:25:44 -0500 Subject: [Swift-devel] swift config error Message-ID: <641219BD-7597-4E4C-8F9D-330635662975@mcs.anl.gov> I have been getting this error today. Execution failed: Swift config property "wrapperlog.always.transfer" not found in Swift configuration [swift.properties] Here is my config file: execution.retries=0 //replication.enabled=true //replication.limit=3 sitedir.keep=true status.mode=provider wrapper.log.always.transfer=true foreach.maxthreads=1024 wrapper.parameter.mode=files use.provider.staging=false provider.staging.pin.swiftfiles=false The log is located at www.ci.uchicago.edu/~jonmon/logs/montage-20110920-1618-xu597sg2.log Were some changes made to the properties? From hategan at mcs.anl.gov Tue Sep 20 16:45:16 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Sep 2011 14:45:16 -0700 Subject: [Swift-devel] swift config error In-Reply-To: <641219BD-7597-4E4C-8F9D-330635662975@mcs.anl.gov> References: <641219BD-7597-4E4C-8F9D-330635662975@mcs.anl.gov> Message-ID: <1316555116.13853.1.camel@blabla> On Tue, 2011-09-20 at 16:25 -0500, Jonathan Monette wrote: > I have been getting this error today. > Execution failed: > Swift config property > wrapperlog.always.transfer" not found in Swift configuration [swift.properties] > ... > wrapper.log.always.transfer=true > Were some changes made to the properties? No. But you have an extra dot. From hategan at mcs.anl.gov Tue Sep 20 16:49:23 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Sep 2011 14:49:23 -0700 Subject: [Swift-devel] Swift 0.93 exception on Fusion In-Reply-To: References: <1313610458.24629.0.camel@blabla> <824503992.69090.1313610821222.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1316555363.13853.4.camel@blabla> On Tue, 2011-09-20 at 14:00 -0500, Tim Armstrong wrote: > This looks like the same bug I encountered in 0.92.1: > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=569 > > This affects SwiftR so I'm trying to work out what my options are to > resolve it. Is it possible to backport the fix to the 0.92 release? > I can do this myself, or simply patch my own version of the source if > you let me know the svn revision. I don't think this was fixed. But I also don't think that it makes much difference. Are your jobs otherwise failing? > > I'm also not sure what the overall best approach for bundling Swift > with SwiftR is. My current approach is to simply include the most > recent stable Swift release, but I seem to be missing out on relevant > bugfixes from trunk. You are also missing the lots and lots of bugs from trunk. We generally quickly backport fixes from trunk to the stable branch if they are found in both, but less quickly the other way around. > I'm looking in the near future to start distributed SwiftR more > widely as a downloadable R package so pushing out bugfixes may become > more of a pain. Any thoughts on which version of Swift I should > bundle with SwiftR in order to maximise stability? The stable one. From tim.g.armstrong at gmail.com Tue Sep 20 17:17:09 2011 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Tue, 20 Sep 2011 17:17:09 -0500 Subject: [Swift-devel] Swift 0.93 exception on Fusion In-Reply-To: <1316555363.13853.4.camel@blabla> References: <1313610458.24629.0.camel@blabla> <824503992.69090.1313610821222.JavaMail.root@zimbra-mb2.anl.gov> <1316555363.13853.4.camel@blabla> Message-ID: Looking again, I think it may actually be a distinct issue that was triggered by some login node performance problems on PADS. The mention of the same pbs submit file in .globus/scripts: PBS8380254050153377753.submit misled me. On Tue, Sep 20, 2011 at 4:49 PM, Mihael Hategan wrote: > On Tue, 2011-09-20 at 14:00 -0500, Tim Armstrong wrote: > > This looks like the same bug I encountered in 0.92.1: > > https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=569 > > > > This affects SwiftR so I'm trying to work out what my options are to > > resolve it. Is it possible to backport the fix to the 0.92 release? > > I can do this myself, or simply patch my own version of the source if > > you let me know the svn revision. > > I don't think this was fixed. But I also don't think that it makes much > difference. Are your jobs otherwise failing? > > > > I'm also not sure what the overall best approach for bundling Swift > > with SwiftR is. My current approach is to simply include the most > > recent stable Swift release, but I seem to be missing out on relevant > > bugfixes from trunk. > > You are also missing the lots and lots of bugs from trunk. We generally > quickly backport fixes from trunk to the stable branch if they are found > in both, but less quickly the other way around. > > > I'm looking in the near future to start distributed SwiftR more > > widely as a downloadable R package so pushing out bugfixes may become > > more of a pain. Any thoughts on which version of Swift I should > > bundle with SwiftR in order to maximise stability? > > The stable one. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Sep 20 17:29:48 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Sep 2011 15:29:48 -0700 Subject: [Swift-devel] case sensitivity In-Reply-To: <1316493563.9498.5.camel@blabla> References: <1423072562.18954.1316458967049.JavaMail.root@zimbra.anl.gov> <1316493563.9498.5.camel@blabla> Message-ID: <1316557788.12443.0.camel@blabla> It's in. r5140. Hopefully working better this time. On Mon, 2011-09-19 at 21:39 -0700, Mihael Hategan wrote: > Ooops. Yes. Seems like there's more to this than what I did. I should > have a fix in tomorrow. > > Mihael > > On Mon, 2011-09-19 at 14:02 -0500, Michael Wilde wrote: > > I think toInt is broken in trunk. > > > > I just svn up'ed my trunk and cant seem to get any variant of toint() to work. They all give: > > > > fusion$ cat ti.swift > > int i = @toInt("123"); > > fusion$ swift ti.swift > > no sites file specified, setting to default: /homes/wilde/swift/src/trunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > > Could not start execution: > > Compile error in assignment at line 2: Unknown function: @toInt: > > Unknown function: @toInt > > fusion$ > > > > I tried @toint(), @toInt() and toInt(). > > > > - Mike > > > > > > ----- Original Message ----- > > > From: "Michael Wilde" > > > To: "Mihael Hategan" > > > Cc: "Swift Devel" > > > Sent: Monday, September 19, 2011 11:44:12 AM > > > Subject: Re: [Swift-devel] case sensitivity > > > I feel that backwards compatibility - at least for some length period > > > of time, but better yet indefinitely - needs to take priority over > > > aesthetic issues. > > > > > > So I can see providing and documenting a function like toInt() while > > > indefinitely allowing toint(). I agree that we should be consistent is > > > using camelCase. > > > > > > However, until Yadu's email below I was never aware of any case > > > insensitivity in Swift - I always programmed as if all names were case > > > sensitive. I didnt pay close attention to the brief list discussion on > > > case below. Was there any follow-up discussion? > > > > > > Can you summarize what the issues are? > > > > > > I believe that Swift should be fully case sensitive for all names, and > > > that we should not cause existing user code to break unless the cost > > > of such backwards compatibility is much worse than the breakage. > > > > > > - Mike > > > > > > ----- Forwarded Message ----- > > > From: "Yadu Nand" > > > To: "Ben Clifford" > > > Cc: "swift-devel" > > > Sent: Tuesday, August 9, 2011 10:01:22 AM > > > Subject: Re: [Swift-devel] Overwriting procedures in swift. > > > > > > > what happens with case? (and, what *should* happen with case?) > > > > > > (int o) f ( int i) { > > > o = i; > > > } > > > (int z) F (int a){ > > > z = a * 5 ; > > > } > > > trace ( f (5) , F(5) ); > > > > > > for the above snippet, trace returns 25, 25. > > > So F is overwriting f anyway. I don't think this is right. > > > > > > > I think karajan identifiers are case insensitive (?) but this patch > > > > looks > > > > like it is case-sensitive. > > > > > > Fixed it. Please check the new patch attached. > > > > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "Swift Devel" > > > > Sent: Monday, September 19, 2011 11:06:17 AM > > > > Subject: [Swift-devel] case sensitivity > > > > Things have now become truly case sensitive in trunk. Which means > > > > that > > > > there is a distinction between toint and toInt. I opted for toInt > > > > for > > > > now. In any event, your scripts may break if they use the wrong > > > > capitalization. > > > > > > > > We should document this, and, before releasing 0.94, decide on the > > > > details of the capitalization of various things. > > > > > > > > Mihael > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Sep 20 17:34:50 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 20 Sep 2011 17:34:50 -0500 (CDT) Subject: [Swift-devel] New 0.93 deadlock - mapping.RootArrayDataNode? In-Reply-To: <555774935.23911.1316557963965.JavaMail.root@zimbra.anl.gov> Message-ID: <29497613.23933.1316558090414.JavaMail.root@zimbra.anl.gov> 0.93 just deadlocked for me on Fusion, starting the ParVis script. log and jstack are on ci net in ~wilde/ - copies of these files: -rw-r--r-- 1 wilde mcsz 69992 Sep 20 17:24 amwg_stats-20110920-1718-bj4kcf16.jstack -rw-r--r-- 1 wilde mcsz 575149 Sep 20 17:18 amwg_stats-20110920-1718-bj4kcf16.log The script didnt get very far - never started a job: RunID: 20110920-1718-bj4kcf16 Progress: time: Tue, 20 Sep 2011 17:18:38 -0500 SwiftScript trace: test_inst: -1 SwiftScript trace: test_djf: NEXT SwiftScript trace: test_case: HRC06 SwiftScript trace: test_nyrs: 10 SwiftScript trace: workdir: /home/wilde/amwg/run01/output/diag/HRC06/ SwiftScript trace: diag_code: /home/wilde/amwg/run01/code/ SwiftScript trace: paleo: False SwiftScript trace: test_path: /fusion/group/climate/Parvis/atmos/HRC06/ SwiftScript trace: test_begin: 110 SwiftScript trace: cntl_djf: NEXT SwiftScript trace: cntl_out: /home/wilde/amwg/run01/output/diag/HRC06//dummy SwiftScript trace: cntl_out_climo: /home/wilde/amwg/run01/output/climo/HRC06//dummy SwiftScript trace: cntl_case: dummy SwiftScript trace: cntl_path: /fusion/group/climate/Parvis/atmos/HRC06/ SwiftScript trace: plots: DJF SwiftScript trace: plots: ANN SwiftScript trace: plots: JJA Progress: time: Tue, 20 Sep 2011 17:19:08 -0500 Initializing:150 Selecting site:12 Stage in:8 Progress: time: Tue, 20 Sep 2011 17:19:38 -0500 Initializing:150 Selecting site:12 Stage in:8 Progress: time: Tue, 20 Sep 2011 17:20:08 -0500 Initializing:150 Selecting site:12 Stage in:8 Jstack says> Found one Java-level deadlock: ============================= "pool-1-thread-16": waiting to lock monitor 0x00000000408c3608 (object 0x00002aaab838d778, a org.griphyn.vdl.mapping.RootArrayDataNode), which is held by "pool-1-thread-7" "pool-1-thread-7": waiting to lock monitor 0x00002aaac88a06d0 (object 0x00002aaab7d93af8, a org.griphyn.vdl.mapping.RootArrayDataNode), which is held by "pool-1-thread-15" "pool-1-thread-15": waiting to lock monitor 0x00000000408c3608 (object 0x00002aaab838d778, a org.griphyn.vdl.mapping.RootArrayDataNode), which is held by "pool-1-thread-7" Java stack information for the threads listed above: =================================================== "pool-1-thread-16": at org.griphyn.vdl.mapping.RootArrayDataNode.getMapper(RootArrayDataNode.java:99) - waiting to lock <0x00002aaab838d778> (a org.griphyn.vdl.mapping.RootArrayDataNode) at org.griphyn.vdl.mapping.AbstractDataNode.getMapper(AbstractDataNode.java:571) at org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:270) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:187) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:175) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:171) at org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:17) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) "pool-1-thread-7": at org.griphyn.vdl.mapping.ArrayDataNode.getFutureWrapper(ArrayDataNode.java:88) - waiting to lock <0x00002aaab7d93af8> (a org.griphyn.vdl.mapping.RootArrayDataNode) at org.griphyn.vdl.mapping.RootArrayDataNode.getMapper(RootArrayDataNode.java:103) - locked <0x00002aaab838d778> (a org.griphyn.vdl.mapping.RootArrayDataNode) at org.griphyn.vdl.mapping.AbstractDataNode.getMapper(AbstractDataNode.java:571) at org.griphyn.vdl.karajan.lib.VDLFunction.leafFileName(VDLFunction.java:270) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:187) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:175) at org.griphyn.vdl.karajan.lib.VDLFunction.filename(VDLFunction.java:171) at org.griphyn.vdl.karajan.lib.swiftscript.FileName.function(FileName.java:17) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) at org.globus.cog.karajan.workflow.nodes.AbstractSequentialWithArguments.completed(AbstractSequentialWithArguments.java:194) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete(FlowNode.java:214) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post(FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.functions.AbstractFunction.post(AbstractFunction.java:28) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext(Sequential.java:29) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChildren(Sequential.java:20) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java:63) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java:139) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java:197) at org.globus.cog.karajan.workflow.events.EventBus.start(EventBus.java:104) at org.globus.cog.karajan.workflow.events.EventTargetPair.run(EventTargetPair.java:40) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) "pool-1-thread-15": at org.griphyn.vdl.mapping.RootArrayDataNode.innerInit(RootArrayDataNode.java:40) - waiting to lock <0x00002aaab838d778> (a org.griphyn.vdl.mapping.RootArrayDataNode) at org.griphyn.vdl.mapping.RootArrayDataNode.futureModified(RootArrayDataNode.java:80) at org.griphyn.vdl.karajan.ArrayIndexFutureList.notifyListeners(ArrayIndexFutureList.java:120) at org.griphyn.vdl.karajan.ArrayIndexFutureList.addKey(ArrayIndexFutureList.java:57) at org.griphyn.vdl.mapping.ArrayDataNode.addKey(ArrayDataNode.java:73) - locked <0x00002aaab7d93af8> (a org.griphyn.vdl.mapping.RootArrayDataNode) at org.griphyn.vdl.mapping.ArrayDataNode.createDSHandle(ArrayDataNode.java:82) at org.griphyn.vdl.mapping.AbstractDataNode.getField(AbstractDataNode.java:270) - locked <0x00002aaab7d93af8> (a org.griphyn.vdl.mapping.RootArrayDataNode) at org.griphyn.vdl.mapping.AbstractDataNode.getField(AbstractDataNode.java:195) at org.griphyn.vdl.karajan.lib.SetFieldValue.deepCopy(SetFieldValue.java:127) at org.griphyn.vdl.karajan.lib.SetFieldValue.function(SetFieldValue.java:46) at org.griphyn.vdl.karajan.lib.VDLFunction.post(VDLFunction.java:62) --More--(95%) -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Tue Sep 20 21:07:52 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Sep 2011 19:07:52 -0700 Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <959559750.111934.1316488816675.JavaMail.root@zimbra-mb2.anl.gov> References: <959559750.111934.1316488816675.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1316570872.13408.0.camel@blabla> Fixed in r 5141. On Mon, 2011-09-19 at 22:20 -0500, David Kelly wrote: > I tried today with the 0.93 update. It ran for approximately 7 hours before freezing. It looks to be happening in a different place this time. > > http://www.ci.uchicago.edu/~davidk/swat4/jstack.log > http://www.ci.uchicago.edu/~davidk/swat4/cce_ua-20110919-1955-h7t8iui2.log > > David > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "David Kelly" , "swift-devel Devel" , "Papia Rizwan" > > > > Sent: Saturday, September 17, 2011 11:36:25 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > I have a tentative fix in the branch and trunk. Revisions 5123 and > > 5124, > > respectively. Please let me know how that works out. > > > > Mihae > > > > On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > > > David and Papia, can you report to the list what the status is of > > > running the SWAT app? > > > > > > - I understand that Mihael will work on the 0.93 deadlock fix this > > > weekend, which is great. > > > > > > - I understand that its happening on trunk as well > > > > > > - Papia, can you try to "perturb" the Swift code in the hopes that > > > some equivalent but different code doesnt trip into the same bug? Ie > > > try a different mapper, different variable strategy (ie arrays vs > > > scalars, structs vs separate vars) just to see if you can work > > > around this? Or, put in some shell logic to catch the hang and kill > > > and re-run (or resume) Swift? if you just kill a hung script and > > > then resume it, will it work? We could maybe alter the hang checker > > > to kill swift on its own, with a return code or message that you > > > could use to trigger a resume. > > > > > > Mike > > > > > > > > > ----- Original Message ----- > > > > From: "David Kelly" > > > > To: "Michael Wilde" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" > > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > I was able to get it running on PADS with trunk. I ran into the > > > > same > > > > issue. > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > From: "David Kelly" > > > > > To: "Michael Wilde" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" > > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using > > > > > passive > > > > > persistent coasters. Is there a way to use automatic coasters on > > > > > the > > > > > MCS workstations? I'll try copying this over to PADS and running > > > > > there > > > > > to see if I can reproduce it. > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Michael Wilde" > > > > > > To: "David Kelly" > > > > > > Cc: "swift-devel Devel" , "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > Can you make SWAT run under trunk, as Papia is testing using > > > > > > standard > > > > > > auto coasters, and doesnt need any of the missing > > > > > > coaster-service > > > > > > options. > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "David Kelly" > > > > > > > To: "Michael Wilde" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > I got past the compilation errors by renaming the all > > > > > > > functions > > > > > > > with > > > > > > > capitalization, but ran into an issue with coaster-service. > > > > > > > Last > > > > > > > week > > > > > > > I noticed coaster-service was missing options for dynamic > > > > > > > ports. > > > > > > > I > > > > > > > found today that it is also missing -passive. I'll try to > > > > > > > track > > > > > > > down > > > > > > > where this changed and restore the previous version. > > > > > > > > > > > > > > David > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Michael Wilde" > > > > > > > > To: "David Kelly" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > Excellent, thanks - thats good. I also just verified that > > > > > > > > Papia > > > > > > > > is > > > > > > > > not > > > > > > > > using the overAllocation tags in the sites file, so this > > > > > > > > problem > > > > > > > > is > > > > > > > > clearly a Java deadlock and has nothing to do with the > > > > > > > > scheduling > > > > > > > > problem that the (now fixed) overAllocation problem was > > > > > > > > causing.. > > > > > > > > > > > > > > > > My understanding is that this SWAT script is failing under > > > > > > > > trunk > > > > > > > > because of the recent token case handling issue (I think > > > > > > > > the > > > > > > > > camel-case one). Can you work with Papia to see if either > > > > > > > > that > > > > > > > > issue > > > > > > > > is now fixed, or if her script can be changed to avoid > > > > > > > > that, > > > > > > > > so > > > > > > > > that > > > > > > > > you can both test the SWAT script with trunk, to see if > > > > > > > > the > > > > > > > > deadlock > > > > > > > > still occurs? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "David Kelly" > > > > > > > > > To: "Michael Wilde" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > I narrowed down the problem a bit. Last night I ran > > > > > > > > > jstack > > > > > > > > > on > > > > > > > > > the > > > > > > > > > wrong java process which is why it didn't report a > > > > > > > > > deadlock. > > > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > > > My jstack: > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > > Papia's jstack: > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > To: "David Kelly" > > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > David, it sounds like more analysis is needed here. If > > > > > > > > > > the > > > > > > > > > > SWAT > > > > > > > > > > runs > > > > > > > > > > are not showing a deadlock (but your runs are) then > > > > > > > > > > likely > > > > > > > > > > we > > > > > > > > > > have > > > > > > > > > > two > > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing to > > > > > > > > > > progress > > > > > > > > > > is > > > > > > > > > > due > > > > > > > > > > to > > > > > > > > > > the overAllocation parameter problem that Mihael fixed > > > > > > > > > > yesterday. > > > > > > > > > > The > > > > > > > > > > symptom there is that Swift starts a coaster with a > > > > > > > > > > time > > > > > > > > > > slot > > > > > > > > > > too > > > > > > > > > > small for the apps in the script, and no apps wind up > > > > > > > > > > running. > > > > > > > > > > I > > > > > > > > > > think > > > > > > > > > > that situation in general merits a separate ticket, > > > > > > > > > > and > > > > > > > > > > may > > > > > > > > > > have > > > > > > > > > > been > > > > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are > > > > > > > > > > hanging > > > > > > > > > > for > > > > > > > > > > a > > > > > > > > > > reason > > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > The jstack log corresponds to the most recent log > > > > > > > > > > > file - > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > > jstack does not report any deadlocks, but I thought > > > > > > > > > > > it > > > > > > > > > > > might > > > > > > > > > > > be > > > > > > > > > > > useful > > > > > > > > > > > so I included it. Swift was not making any progress > > > > > > > > > > > for > > > > > > > > > > > about > > > > > > > > > > > 5 > > > > > > > > > > > hours > > > > > > > > > > > before I sent the logs. I am running the latest 0.93 > > > > > > > > > > > branch. > > > > > > > > > > > I > > > > > > > > > > > will > > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > David, which of the many Swift logs in that /swat > > > > > > > > > > > > dir > > > > > > > > > > > > does > > > > > > > > > > > > the > > > > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are > > > > > > > > > > > > running > > > > > > > > > > > > on > > > > > > > > > > > > the > > > > > > > > > > > > latest > > > > > > > > > > > > rev > > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" , "Michael > > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > I was able to reproduce the problem with > > > > > > > > > > > > > persistent > > > > > > > > > > > > > coasters > > > > > > > > > > > > > on > > > > > > > > > > > > > the > > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael > > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > I think I am seeing a similar deadlock on > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > > and am trying to get a clean log and jstack > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the > > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as > > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > Wilde" , "Michael P. > > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > 1:56:13 PM > > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > > Attached are the jstack output and the log > > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Michael Wilde > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > -- > > > > > > > > Michael Wilde > > > > > > > > Computation Institute, University of Chicago > > > > > > > > Mathematics and Computer Science Division > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > -- > > > > > > Michael Wilde > > > > > > Computation Institute, University of Chicago > > > > > > Mathematics and Computer Science Division > > > > > > Argonne National Laboratory > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > From hategan at mcs.anl.gov Tue Sep 20 21:08:46 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Sep 2011 19:08:46 -0700 Subject: [Swift-devel] NPE when cpus register to persistent coaster service In-Reply-To: <799928578.20388.1316491282825.JavaMail.root@zimbra.anl.gov> References: <799928578.20388.1316491282825.JavaMail.root@zimbra.anl.gov> Message-ID: <1316570926.13408.1.camel@blabla> Trunk or branch? On Mon, 2011-09-19 at 23:01 -0500, Michael Wilde wrote: > Im seeing this error in the service log (in trunk) when workers register CPUs with the service: > > (Swift svn swift-r5131 (swift modified locally) cog-r3286) > > 2011-09-19 22:52:57,882-0500 INFO AbstractStreamKarajanChannel nullChannel started > 2011-09-19 22:52:57,882-0500 INFO AbstractStreamKarajanChannel$Multiplexer (0) Scheduling SC-null for addition > 2011-09-19 22:52:57,882-0500 INFO AbstractStreamKarajanChannel nullChannel started > 2011-09-19 22:52:57,907-0500 INFO LocalTCPService Received registration: blockid = swork3, url = f1 > 2011-09-19 22:52:57,917-0500 INFO AbstractStreamKarajanChannel$Multiplexer (0) Scheduling SC-null for addition > 2011-09-19 22:52:57,917-0500 INFO AbstractStreamKarajanChannel nullChannel started > 2011-09-19 22:52:57,949-0500 INFO MetaChannel MetaChannel: 381531395[1729747990: {}] -> null.bind -> SC-null > 2011-09-19 22:52:57,952-0500 DEBUG Cpu worker started: block=swork3 host=f1 id=0 > 2011-09-19 22:52:57,952-0500 DEBUG Cpu ready for work: block=swork3 id=0 > 2011-09-19 22:52:57,953-0500 INFO Block Started CPU 0:1316490777s > 2011-09-19 22:52:57,953-0500 TRACE Cpu swork3:0 pull > 2011-09-19 22:52:57,953-0500 INFO Block Started worker swork3:000000 > 2011-09-19 22:52:57,953-0500 DEBUG Cpu requesting work: block=swork3 id=0 Cpus sleeping: 1 > 2011-09-19 22:52:57,953-0500 DEBUG Cpu swork3:0 sleeping > 2011-09-19 22:52:57,954-0500 DEBUG PullThread sleep: 0:1316490777s > 2011-09-19 22:52:57,955-0500 WARN BlockQueueProcessor Failed to send worker status update to client > java.lang.NullPointerException > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.j > ava:72) > at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) > at org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) > at org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) > at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) > at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChanne > l.java:375) > 2011-09-19 22:52:57,959-0500 INFO LocalTCPService Received registration: blockid = swork6, url = f1 > 2011-09-19 22:52:57,959-0500 INFO MetaChannel MetaChannel: 558519794[723566380: {}] -> null.bind -> SC-null > 2011-09-19 22:52:57,959-0500 DEBUG Cpu worker started: block=swork6 host=f1 id=0 > 2011-09-19 22:52:57,959-0500 DEBUG Cpu ready for work: block=swork6 id=0 > 2011-09-19 22:52:57,959-0500 INFO Block Started CPU 0:1316490777s > 2011-09-19 22:52:57,959-0500 TRACE Cpu swork6:0 pull > 2011-09-19 22:52:57,960-0500 INFO Block Started worker swork6:000000 > 2011-09-19 22:52:57,960-0500 DEBUG Cpu requesting work: block=swork6 id=0 Cpus sleeping: 2 > 2011-09-19 22:52:57,960-0500 WARN BlockQueueProcessor Failed to send worker status update to client > java.lang.NullPointerException > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) > at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.j > ava:72) > at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) > at org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) > at org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) > at org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) > at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) > at org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChanne > l.java:375) > > The service still seems to work. > > - Mike > From hategan at mcs.anl.gov Tue Sep 20 21:15:39 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Sep 2011 19:15:39 -0700 Subject: [Swift-devel] ParVis AMWG script fails to compile on trunk: unexpected character in markup In-Reply-To: <2021384179.22298.1316538140935.JavaMail.root@zimbra.anl.gov> References: <2021384179.22298.1316538140935.JavaMail.root@zimbra.anl.gov> Message-ID: <1316571339.13408.2.camel@blabla> A consequence of the previous case sensitivity code. Should work now. On Tue, 2011-09-20 at 12:02 -0500, Michael Wilde wrote: > Trunk is giving the following error when trying to run the ParVis AMWG script (Im using -typecheck here but the same failure occurs without -typecheck): > > fusion$ swift -typecheck amwg_stats.swift > no sites file specified, setting to default: /homes/wilde/swift/src/trunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > Could not start execution: > Error reading source: : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) : : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) : > : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) : : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) : > : unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) : > unexpected character in markup - (position: START_TAG seen ...\n <-... @2063:13) > > > 0.93 seems to compile it OK, but note what looks like a bug: -typecheck doesnt stop swift from trying to execute the script: > > > fusion$ ~/swift/src/0.93/cog/modules/swift/dist/swift-svn/bin/swift -typecheck amwg_stats.swift > no sites file specified, setting to default: /homes/wilde/swift/src/trunk/cog/modules/swift/dist/swift-svn/etc/sites.xml > Swift svn swift-r5105 cog-r3262 > > RunID: 20110920-1157-art6r730 > Progress: time: Tue, 20 Sep 2011 11:57:56 -0500 > Execution failed: > Missing command line argument: test_case > SwiftScript trace: cntl_case: dummy > fusion$ > > From wilde at mcs.anl.gov Tue Sep 20 23:01:28 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 20 Sep 2011 23:01:28 -0500 (CDT) Subject: [Swift-devel] NPE when cpus register to persistent coaster service In-Reply-To: <1316570926.13408.1.camel@blabla> Message-ID: <1629880930.24289.1316577688508.JavaMail.root@zimbra.anl.gov> trunk. ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "Swift Devel" > Sent: Tuesday, September 20, 2011 9:08:46 PM > Subject: Re: NPE when cpus register to persistent coaster service > Trunk or branch? > > On Mon, 2011-09-19 at 23:01 -0500, Michael Wilde wrote: > > Im seeing this error in the service log (in trunk) when workers > > register CPUs with the service: > > > > (Swift svn swift-r5131 (swift modified locally) cog-r3286) > > > > 2011-09-19 22:52:57,882-0500 INFO AbstractStreamKarajanChannel > > nullChannel started > > 2011-09-19 22:52:57,882-0500 INFO > > AbstractStreamKarajanChannel$Multiplexer (0) Scheduling SC-null for > > addition > > 2011-09-19 22:52:57,882-0500 INFO AbstractStreamKarajanChannel > > nullChannel started > > 2011-09-19 22:52:57,907-0500 INFO LocalTCPService Received > > registration: blockid = swork3, url = f1 > > 2011-09-19 22:52:57,917-0500 INFO > > AbstractStreamKarajanChannel$Multiplexer (0) Scheduling SC-null for > > addition > > 2011-09-19 22:52:57,917-0500 INFO AbstractStreamKarajanChannel > > nullChannel started > > 2011-09-19 22:52:57,949-0500 INFO MetaChannel MetaChannel: > > 381531395[1729747990: {}] -> null.bind -> SC-null > > 2011-09-19 22:52:57,952-0500 DEBUG Cpu worker started: block=swork3 > > host=f1 id=0 > > 2011-09-19 22:52:57,952-0500 DEBUG Cpu ready for work: block=swork3 > > id=0 > > 2011-09-19 22:52:57,953-0500 INFO Block Started CPU 0:1316490777s > > 2011-09-19 22:52:57,953-0500 TRACE Cpu swork3:0 pull > > 2011-09-19 22:52:57,953-0500 INFO Block Started worker swork3:000000 > > 2011-09-19 22:52:57,953-0500 DEBUG Cpu requesting work: block=swork3 > > id=0 Cpus sleeping: 1 > > 2011-09-19 22:52:57,953-0500 DEBUG Cpu swork3:0 sleeping > > 2011-09-19 22:52:57,954-0500 DEBUG PullThread sleep: 0:1316490777s > > 2011-09-19 22:52:57,955-0500 WARN BlockQueueProcessor Failed to send > > worker status update to client > > java.lang.NullPointerException > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.j > > ava:72) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) > > at > > org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) > > at > > org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) > > at > > org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChanne > > l.java:375) > > 2011-09-19 22:52:57,959-0500 INFO LocalTCPService Received > > registration: blockid = swork6, url = f1 > > 2011-09-19 22:52:57,959-0500 INFO MetaChannel MetaChannel: > > 558519794[723566380: {}] -> null.bind -> SC-null > > 2011-09-19 22:52:57,959-0500 DEBUG Cpu worker started: block=swork6 > > host=f1 id=0 > > 2011-09-19 22:52:57,959-0500 DEBUG Cpu ready for work: block=swork6 > > id=0 > > 2011-09-19 22:52:57,959-0500 INFO Block Started CPU 0:1316490777s > > 2011-09-19 22:52:57,959-0500 TRACE Cpu swork6:0 pull > > 2011-09-19 22:52:57,960-0500 INFO Block Started worker swork6:000000 > > 2011-09-19 22:52:57,960-0500 DEBUG Cpu requesting work: block=swork6 > > id=0 Cpus sleeping: 2 > > 2011-09-19 22:52:57,960-0500 WARN BlockQueueProcessor Failed to send > > worker status update to client > > java.lang.NullPointerException > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.getMetaChannel(ChannelManager.java:433) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.registrationReceived(PassiveQueueProcessor.j > > ava:72) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.registrationReceived(JobQueue.java:143) > > at > > org.globus.cog.abstraction.coaster.service.LocalTCPService.registrationReceived(LocalTCPService.java:64) > > at > > org.globus.cog.abstraction.coaster.service.local.RegistrationHandler.requestComplete(RegistrationHandler.java:57) > > at > > org.globus.cog.karajan.workflow.service.handlers.RequestHandler.receiveCompleted(RequestHandler.java:84) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleRequest(AbstractKarajanChannel.java:416) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:157) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChanne > > l.java:375) > > > > The service still seems to work. > > > > - Mike > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Sep 21 02:08:58 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 21 Sep 2011 00:08:58 -0700 Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <959559750.111934.1316488816675.JavaMail.root@zimbra-mb2.anl.gov> References: <959559750.111934.1316488816675.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1316588938.22874.0.camel@blabla> Fix in r5143. Please test. On Mon, 2011-09-19 at 22:20 -0500, David Kelly wrote: > I tried today with the 0.93 update. It ran for approximately 7 hours before freezing. It looks to be happening in a different place this time. > > http://www.ci.uchicago.edu/~davidk/swat4/jstack.log > http://www.ci.uchicago.edu/~davidk/swat4/cce_ua-20110919-1955-h7t8iui2.log > > David > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "David Kelly" , "swift-devel Devel" , "Papia Rizwan" > > > > Sent: Saturday, September 17, 2011 11:36:25 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > I have a tentative fix in the branch and trunk. Revisions 5123 and > > 5124, > > respectively. Please let me know how that works out. > > > > Mihae > > > > On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > > > David and Papia, can you report to the list what the status is of > > > running the SWAT app? > > > > > > - I understand that Mihael will work on the 0.93 deadlock fix this > > > weekend, which is great. > > > > > > - I understand that its happening on trunk as well > > > > > > - Papia, can you try to "perturb" the Swift code in the hopes that > > > some equivalent but different code doesnt trip into the same bug? Ie > > > try a different mapper, different variable strategy (ie arrays vs > > > scalars, structs vs separate vars) just to see if you can work > > > around this? Or, put in some shell logic to catch the hang and kill > > > and re-run (or resume) Swift? if you just kill a hung script and > > > then resume it, will it work? We could maybe alter the hang checker > > > to kill swift on its own, with a return code or message that you > > > could use to trigger a resume. > > > > > > Mike > > > > > > > > > ----- Original Message ----- > > > > From: "David Kelly" > > > > To: "Michael Wilde" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" > > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > I was able to get it running on PADS with trunk. I ran into the > > > > same > > > > issue. > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > > > David > > > > > > > > ----- Original Message ----- > > > > > From: "David Kelly" > > > > > To: "Michael Wilde" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" > > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using > > > > > passive > > > > > persistent coasters. Is there a way to use automatic coasters on > > > > > the > > > > > MCS workstations? I'll try copying this over to PADS and running > > > > > there > > > > > to see if I can reproduce it. > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Michael Wilde" > > > > > > To: "David Kelly" > > > > > > Cc: "swift-devel Devel" , "Papia > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > Can you make SWAT run under trunk, as Papia is testing using > > > > > > standard > > > > > > auto coasters, and doesnt need any of the missing > > > > > > coaster-service > > > > > > options. > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "David Kelly" > > > > > > > To: "Michael Wilde" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > I got past the compilation errors by renaming the all > > > > > > > functions > > > > > > > with > > > > > > > capitalization, but ran into an issue with coaster-service. > > > > > > > Last > > > > > > > week > > > > > > > I noticed coaster-service was missing options for dynamic > > > > > > > ports. > > > > > > > I > > > > > > > found today that it is also missing -passive. I'll try to > > > > > > > track > > > > > > > down > > > > > > > where this changed and restore the previous version. > > > > > > > > > > > > > > David > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Michael Wilde" > > > > > > > > To: "David Kelly" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > Excellent, thanks - thats good. I also just verified that > > > > > > > > Papia > > > > > > > > is > > > > > > > > not > > > > > > > > using the overAllocation tags in the sites file, so this > > > > > > > > problem > > > > > > > > is > > > > > > > > clearly a Java deadlock and has nothing to do with the > > > > > > > > scheduling > > > > > > > > problem that the (now fixed) overAllocation problem was > > > > > > > > causing.. > > > > > > > > > > > > > > > > My understanding is that this SWAT script is failing under > > > > > > > > trunk > > > > > > > > because of the recent token case handling issue (I think > > > > > > > > the > > > > > > > > camel-case one). Can you work with Papia to see if either > > > > > > > > that > > > > > > > > issue > > > > > > > > is now fixed, or if her script can be changed to avoid > > > > > > > > that, > > > > > > > > so > > > > > > > > that > > > > > > > > you can both test the SWAT script with trunk, to see if > > > > > > > > the > > > > > > > > deadlock > > > > > > > > still occurs? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "David Kelly" > > > > > > > > > To: "Michael Wilde" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > I narrowed down the problem a bit. Last night I ran > > > > > > > > > jstack > > > > > > > > > on > > > > > > > > > the > > > > > > > > > wrong java process which is why it didn't report a > > > > > > > > > deadlock. > > > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > > > My jstack: > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > > Papia's jstack: > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > To: "David Kelly" > > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > David, it sounds like more analysis is needed here. If > > > > > > > > > > the > > > > > > > > > > SWAT > > > > > > > > > > runs > > > > > > > > > > are not showing a deadlock (but your runs are) then > > > > > > > > > > likely > > > > > > > > > > we > > > > > > > > > > have > > > > > > > > > > two > > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing to > > > > > > > > > > progress > > > > > > > > > > is > > > > > > > > > > due > > > > > > > > > > to > > > > > > > > > > the overAllocation parameter problem that Mihael fixed > > > > > > > > > > yesterday. > > > > > > > > > > The > > > > > > > > > > symptom there is that Swift starts a coaster with a > > > > > > > > > > time > > > > > > > > > > slot > > > > > > > > > > too > > > > > > > > > > small for the apps in the script, and no apps wind up > > > > > > > > > > running. > > > > > > > > > > I > > > > > > > > > > think > > > > > > > > > > that situation in general merits a separate ticket, > > > > > > > > > > and > > > > > > > > > > may > > > > > > > > > > have > > > > > > > > > > been > > > > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are > > > > > > > > > > hanging > > > > > > > > > > for > > > > > > > > > > a > > > > > > > > > > reason > > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > The jstack log corresponds to the most recent log > > > > > > > > > > > file - > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > > jstack does not report any deadlocks, but I thought > > > > > > > > > > > it > > > > > > > > > > > might > > > > > > > > > > > be > > > > > > > > > > > useful > > > > > > > > > > > so I included it. Swift was not making any progress > > > > > > > > > > > for > > > > > > > > > > > about > > > > > > > > > > > 5 > > > > > > > > > > > hours > > > > > > > > > > > before I sent the logs. I am running the latest 0.93 > > > > > > > > > > > branch. > > > > > > > > > > > I > > > > > > > > > > > will > > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > David, which of the many Swift logs in that /swat > > > > > > > > > > > > dir > > > > > > > > > > > > does > > > > > > > > > > > > the > > > > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are > > > > > > > > > > > > running > > > > > > > > > > > > on > > > > > > > > > > > > the > > > > > > > > > > > > latest > > > > > > > > > > > > rev > > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" , "Michael > > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > I was able to reproduce the problem with > > > > > > > > > > > > > persistent > > > > > > > > > > > > > coasters > > > > > > > > > > > > > on > > > > > > > > > > > > > the > > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > > Could you also forward the attachments please? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael > > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > I think I am seeing a similar deadlock on > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > > and am trying to get a clean log and jstack > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running the > > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem as > > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > Wilde" , "Michael P. > > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > 1:56:13 PM > > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > > Attached are the jstack output and the log > > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Michael Wilde > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > -- > > > > > > > > Michael Wilde > > > > > > > > Computation Institute, University of Chicago > > > > > > > > Mathematics and Computer Science Division > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > -- > > > > > > Michael Wilde > > > > > > Computation Institute, University of Chicago > > > > > > Mathematics and Computer Science Division > > > > > > Argonne National Laboratory > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > From dsk at ci.uchicago.edu Wed Sep 21 02:31:27 2011 From: dsk at ci.uchicago.edu (Daniel S. Katz) Date: Wed, 21 Sep 2011 09:31:27 +0200 Subject: [Swift-devel] RPC and futures Message-ID: <4E509B3F-451F-4D4A-948A-A64F5B3E2715@ci.uchicago.edu> came across this... http://engineering.twitter.com/2011/08/finagle-protocol-agnostic-rpc-system.html -- Daniel S. Katz University of Chicago (773) 834-7186 (voice) (773) 834-6818 (fax) d.katz at ieee.org or dsk at ci.uchicago.edu http://www.ci.uchicago.edu/~dsk/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Sep 21 08:39:28 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 21 Sep 2011 08:39:28 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1316588938.22874.0.camel@blabla> Message-ID: <64608070.24724.1316612368471.JavaMail.root@zimbra.anl.gov> Thanks, Mihael. Is this the same deadlock as the one I reported recently for 0.93? Sent: Tuesday, September 20, 2011 5:34:50 PM Subject: [Swift-devel] New 0.93 deadlock - mapping.RootArrayDataNode? - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Michael Wilde" > > Sent: Wednesday, September 21, 2011 2:08:58 AM > Subject: Re: [Swift-devel] swift 0.93 deadlock > Fix in r5143. Please test. > > On Mon, 2011-09-19 at 22:20 -0500, David Kelly wrote: > > I tried today with the 0.93 update. It ran for approximately 7 hours > > before freezing. It looks to be happening in a different place this > > time. > > > > http://www.ci.uchicago.edu/~davidk/swat4/jstack.log > > http://www.ci.uchicago.edu/~davidk/swat4/cce_ua-20110919-1955-h7t8iui2.log > > > > David > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Michael Wilde" > > > Cc: "David Kelly" , "swift-devel Devel" > > > , "Papia Rizwan" > > > > > > Sent: Saturday, September 17, 2011 11:36:25 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > I have a tentative fix in the branch and trunk. Revisions 5123 and > > > 5124, > > > respectively. Please let me know how that works out. > > > > > > Mihae > > > > > > On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > > > > David and Papia, can you report to the list what the status is > > > > of > > > > running the SWAT app? > > > > > > > > - I understand that Mihael will work on the 0.93 deadlock fix > > > > this > > > > weekend, which is great. > > > > > > > > - I understand that its happening on trunk as well > > > > > > > > - Papia, can you try to "perturb" the Swift code in the hopes > > > > that > > > > some equivalent but different code doesnt trip into the same > > > > bug? Ie > > > > try a different mapper, different variable strategy (ie arrays > > > > vs > > > > scalars, structs vs separate vars) just to see if you can work > > > > around this? Or, put in some shell logic to catch the hang and > > > > kill > > > > and re-run (or resume) Swift? if you just kill a hung script and > > > > then resume it, will it work? We could maybe alter the hang > > > > checker > > > > to kill swift on its own, with a return code or message that you > > > > could use to trigger a resume. > > > > > > > > Mike > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "David Kelly" > > > > > To: "Michael Wilde" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" > > > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > I was able to get it running on PADS with trunk. I ran into > > > > > the > > > > > same > > > > > issue. > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" > > > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using > > > > > > passive > > > > > > persistent coasters. Is there a way to use automatic > > > > > > coasters on > > > > > > the > > > > > > MCS workstations? I'll try copying this over to PADS and > > > > > > running > > > > > > there > > > > > > to see if I can reproduce it. > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Michael Wilde" > > > > > > > To: "David Kelly" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > Can you make SWAT run under trunk, as Papia is testing > > > > > > > using > > > > > > > standard > > > > > > > auto coasters, and doesnt need any of the missing > > > > > > > coaster-service > > > > > > > options. > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > I got past the compilation errors by renaming the all > > > > > > > > functions > > > > > > > > with > > > > > > > > capitalization, but ran into an issue with > > > > > > > > coaster-service. > > > > > > > > Last > > > > > > > > week > > > > > > > > I noticed coaster-service was missing options for > > > > > > > > dynamic > > > > > > > > ports. > > > > > > > > I > > > > > > > > found today that it is also missing -passive. I'll try > > > > > > > > to > > > > > > > > track > > > > > > > > down > > > > > > > > where this changed and restore the previous version. > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Michael Wilde" > > > > > > > > > To: "David Kelly" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > Excellent, thanks - thats good. I also just verified > > > > > > > > > that > > > > > > > > > Papia > > > > > > > > > is > > > > > > > > > not > > > > > > > > > using the overAllocation tags in the sites file, so > > > > > > > > > this > > > > > > > > > problem > > > > > > > > > is > > > > > > > > > clearly a Java deadlock and has nothing to do with the > > > > > > > > > scheduling > > > > > > > > > problem that the (now fixed) overAllocation problem > > > > > > > > > was > > > > > > > > > causing.. > > > > > > > > > > > > > > > > > > My understanding is that this SWAT script is failing > > > > > > > > > under > > > > > > > > > trunk > > > > > > > > > because of the recent token case handling issue (I > > > > > > > > > think > > > > > > > > > the > > > > > > > > > camel-case one). Can you work with Papia to see if > > > > > > > > > either > > > > > > > > > that > > > > > > > > > issue > > > > > > > > > is now fixed, or if her script can be changed to avoid > > > > > > > > > that, > > > > > > > > > so > > > > > > > > > that > > > > > > > > > you can both test the SWAT script with trunk, to see > > > > > > > > > if > > > > > > > > > the > > > > > > > > > deadlock > > > > > > > > > still occurs? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "David Kelly" > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > I narrowed down the problem a bit. Last night I ran > > > > > > > > > > jstack > > > > > > > > > > on > > > > > > > > > > the > > > > > > > > > > wrong java process which is why it didn't report a > > > > > > > > > > deadlock. > > > > > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > > > > > My jstack: > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > > > Papia's jstack: > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > David, it sounds like more analysis is needed > > > > > > > > > > > here. If > > > > > > > > > > > the > > > > > > > > > > > SWAT > > > > > > > > > > > runs > > > > > > > > > > > are not showing a deadlock (but your runs are) > > > > > > > > > > > then > > > > > > > > > > > likely > > > > > > > > > > > we > > > > > > > > > > > have > > > > > > > > > > > two > > > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing > > > > > > > > > > > to > > > > > > > > > > > progress > > > > > > > > > > > is > > > > > > > > > > > due > > > > > > > > > > > to > > > > > > > > > > > the overAllocation parameter problem that Mihael > > > > > > > > > > > fixed > > > > > > > > > > > yesterday. > > > > > > > > > > > The > > > > > > > > > > > symptom there is that Swift starts a coaster with > > > > > > > > > > > a > > > > > > > > > > > time > > > > > > > > > > > slot > > > > > > > > > > > too > > > > > > > > > > > small for the apps in the script, and no apps wind > > > > > > > > > > > up > > > > > > > > > > > running. > > > > > > > > > > > I > > > > > > > > > > > think > > > > > > > > > > > that situation in general merits a separate > > > > > > > > > > > ticket, > > > > > > > > > > > and > > > > > > > > > > > may > > > > > > > > > > > have > > > > > > > > > > > been > > > > > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are > > > > > > > > > > > hanging > > > > > > > > > > > for > > > > > > > > > > > a > > > > > > > > > > > reason > > > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > The jstack log corresponds to the most recent > > > > > > > > > > > > log > > > > > > > > > > > > file - > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > > > jstack does not report any deadlocks, but I > > > > > > > > > > > > thought > > > > > > > > > > > > it > > > > > > > > > > > > might > > > > > > > > > > > > be > > > > > > > > > > > > useful > > > > > > > > > > > > so I included it. Swift was not making any > > > > > > > > > > > > progress > > > > > > > > > > > > for > > > > > > > > > > > > about > > > > > > > > > > > > 5 > > > > > > > > > > > > hours > > > > > > > > > > > > before I sent the logs. I am running the latest > > > > > > > > > > > > 0.93 > > > > > > > > > > > > branch. > > > > > > > > > > > > I > > > > > > > > > > > > will > > > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > David, which of the many Swift logs in that > > > > > > > > > > > > > /swat > > > > > > > > > > > > > dir > > > > > > > > > > > > > does > > > > > > > > > > > > > the > > > > > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are > > > > > > > > > > > > > running > > > > > > > > > > > > > on > > > > > > > > > > > > > the > > > > > > > > > > > > > latest > > > > > > > > > > > > > rev > > > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > Rizwan" , "Michael > > > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > I was able to reproduce the problem with > > > > > > > > > > > > > > persistent > > > > > > > > > > > > > > coasters > > > > > > > > > > > > > > on > > > > > > > > > > > > > > the > > > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > 10:30:48 > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > Could you also forward the attachments > > > > > > > > > > > > > > > please? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael > > > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > I think I am seeing a similar deadlock > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > > > and am trying to get a clean log and > > > > > > > > > > > > > > > > jstack > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem > > > > > > > > > > > > > > > > as > > > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > Wilde" , "Michael > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > > 1:56:13 PM > > > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > > > Attached are the jstack output and the > > > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Michael Wilde > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Michael Wilde > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > -- > > > > > > > Michael Wilde > > > > > > > Computation Institute, University of Chicago > > > > > > > Mathematics and Computer Science Division > > > > > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Sep 21 13:09:02 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 21 Sep 2011 11:09:02 -0700 Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <64608070.24724.1316612368471.JavaMail.root@zimbra.anl.gov> References: <64608070.24724.1316612368471.JavaMail.root@zimbra.anl.gov> Message-ID: <1316628542.24547.0.camel@blabla> Not quite. But somewhat close. On Wed, 2011-09-21 at 08:39 -0500, Michael Wilde wrote: > Thanks, Mihael. Is this the same deadlock as the one I reported recently for 0.93? > > Sent: Tuesday, September 20, 2011 5:34:50 PM > Subject: [Swift-devel] New 0.93 deadlock - mapping.RootArrayDataNode? > > - Mike > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "David Kelly" > > Cc: "swift-devel Devel" , "Papia Rizwan" , "Michael Wilde" > > > > Sent: Wednesday, September 21, 2011 2:08:58 AM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > Fix in r5143. Please test. > > > > On Mon, 2011-09-19 at 22:20 -0500, David Kelly wrote: > > > I tried today with the 0.93 update. It ran for approximately 7 hours > > > before freezing. It looks to be happening in a different place this > > > time. > > > > > > http://www.ci.uchicago.edu/~davidk/swat4/jstack.log > > > http://www.ci.uchicago.edu/~davidk/swat4/cce_ua-20110919-1955-h7t8iui2.log > > > > > > David > > > > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "Michael Wilde" > > > > Cc: "David Kelly" , "swift-devel Devel" > > > > , "Papia Rizwan" > > > > > > > > Sent: Saturday, September 17, 2011 11:36:25 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > I have a tentative fix in the branch and trunk. Revisions 5123 and > > > > 5124, > > > > respectively. Please let me know how that works out. > > > > > > > > Mihae > > > > > > > > On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > > > > > David and Papia, can you report to the list what the status is > > > > > of > > > > > running the SWAT app? > > > > > > > > > > - I understand that Mihael will work on the 0.93 deadlock fix > > > > > this > > > > > weekend, which is great. > > > > > > > > > > - I understand that its happening on trunk as well > > > > > > > > > > - Papia, can you try to "perturb" the Swift code in the hopes > > > > > that > > > > > some equivalent but different code doesnt trip into the same > > > > > bug? Ie > > > > > try a different mapper, different variable strategy (ie arrays > > > > > vs > > > > > scalars, structs vs separate vars) just to see if you can work > > > > > around this? Or, put in some shell logic to catch the hang and > > > > > kill > > > > > and re-run (or resume) Swift? if you just kill a hung script and > > > > > then resume it, will it work? We could maybe alter the hang > > > > > checker > > > > > to kill swift on its own, with a return code or message that you > > > > > could use to trigger a resume. > > > > > > > > > > Mike > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , "Papia > > > > > > Rizwan" > > > > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > I was able to get it running on PADS with trunk. I ran into > > > > > > the > > > > > > same > > > > > > issue. > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "David Kelly" > > > > > > > To: "Michael Wilde" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" > > > > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using > > > > > > > passive > > > > > > > persistent coasters. Is there a way to use automatic > > > > > > > coasters on > > > > > > > the > > > > > > > MCS workstations? I'll try copying this over to PADS and > > > > > > > running > > > > > > > there > > > > > > > to see if I can reproduce it. > > > > > > > > > > > > > > David > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Michael Wilde" > > > > > > > > To: "David Kelly" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > Can you make SWAT run under trunk, as Papia is testing > > > > > > > > using > > > > > > > > standard > > > > > > > > auto coasters, and doesnt need any of the missing > > > > > > > > coaster-service > > > > > > > > options. > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "David Kelly" > > > > > > > > > To: "Michael Wilde" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > I got past the compilation errors by renaming the all > > > > > > > > > functions > > > > > > > > > with > > > > > > > > > capitalization, but ran into an issue with > > > > > > > > > coaster-service. > > > > > > > > > Last > > > > > > > > > week > > > > > > > > > I noticed coaster-service was missing options for > > > > > > > > > dynamic > > > > > > > > > ports. > > > > > > > > > I > > > > > > > > > found today that it is also missing -passive. I'll try > > > > > > > > > to > > > > > > > > > track > > > > > > > > > down > > > > > > > > > where this changed and restore the previous version. > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > To: "David Kelly" > > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > Excellent, thanks - thats good. I also just verified > > > > > > > > > > that > > > > > > > > > > Papia > > > > > > > > > > is > > > > > > > > > > not > > > > > > > > > > using the overAllocation tags in the sites file, so > > > > > > > > > > this > > > > > > > > > > problem > > > > > > > > > > is > > > > > > > > > > clearly a Java deadlock and has nothing to do with the > > > > > > > > > > scheduling > > > > > > > > > > problem that the (now fixed) overAllocation problem > > > > > > > > > > was > > > > > > > > > > causing.. > > > > > > > > > > > > > > > > > > > > My understanding is that this SWAT script is failing > > > > > > > > > > under > > > > > > > > > > trunk > > > > > > > > > > because of the recent token case handling issue (I > > > > > > > > > > think > > > > > > > > > > the > > > > > > > > > > camel-case one). Can you work with Papia to see if > > > > > > > > > > either > > > > > > > > > > that > > > > > > > > > > issue > > > > > > > > > > is now fixed, or if her script can be changed to avoid > > > > > > > > > > that, > > > > > > > > > > so > > > > > > > > > > that > > > > > > > > > > you can both test the SWAT script with trunk, to see > > > > > > > > > > if > > > > > > > > > > the > > > > > > > > > > deadlock > > > > > > > > > > still occurs? > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > I narrowed down the problem a bit. Last night I ran > > > > > > > > > > > jstack > > > > > > > > > > > on > > > > > > > > > > > the > > > > > > > > > > > wrong java process which is why it didn't report a > > > > > > > > > > > deadlock. > > > > > > > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > > > > > > > My jstack: > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > > > > Papia's jstack: > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > David, it sounds like more analysis is needed > > > > > > > > > > > > here. If > > > > > > > > > > > > the > > > > > > > > > > > > SWAT > > > > > > > > > > > > runs > > > > > > > > > > > > are not showing a deadlock (but your runs are) > > > > > > > > > > > > then > > > > > > > > > > > > likely > > > > > > > > > > > > we > > > > > > > > > > > > have > > > > > > > > > > > > two > > > > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing > > > > > > > > > > > > to > > > > > > > > > > > > progress > > > > > > > > > > > > is > > > > > > > > > > > > due > > > > > > > > > > > > to > > > > > > > > > > > > the overAllocation parameter problem that Mihael > > > > > > > > > > > > fixed > > > > > > > > > > > > yesterday. > > > > > > > > > > > > The > > > > > > > > > > > > symptom there is that Swift starts a coaster with > > > > > > > > > > > > a > > > > > > > > > > > > time > > > > > > > > > > > > slot > > > > > > > > > > > > too > > > > > > > > > > > > small for the apps in the script, and no apps wind > > > > > > > > > > > > up > > > > > > > > > > > > running. > > > > > > > > > > > > I > > > > > > > > > > > > think > > > > > > > > > > > > that situation in general merits a separate > > > > > > > > > > > > ticket, > > > > > > > > > > > > and > > > > > > > > > > > > may > > > > > > > > > > > > have > > > > > > > > > > > > been > > > > > > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are > > > > > > > > > > > > hanging > > > > > > > > > > > > for > > > > > > > > > > > > a > > > > > > > > > > > > reason > > > > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > The jstack log corresponds to the most recent > > > > > > > > > > > > > log > > > > > > > > > > > > > file - > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > > > > jstack does not report any deadlocks, but I > > > > > > > > > > > > > thought > > > > > > > > > > > > > it > > > > > > > > > > > > > might > > > > > > > > > > > > > be > > > > > > > > > > > > > useful > > > > > > > > > > > > > so I included it. Swift was not making any > > > > > > > > > > > > > progress > > > > > > > > > > > > > for > > > > > > > > > > > > > about > > > > > > > > > > > > > 5 > > > > > > > > > > > > > hours > > > > > > > > > > > > > before I sent the logs. I am running the latest > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > branch. > > > > > > > > > > > > > I > > > > > > > > > > > > > will > > > > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > > David, which of the many Swift logs in that > > > > > > > > > > > > > > /swat > > > > > > > > > > > > > > dir > > > > > > > > > > > > > > does > > > > > > > > > > > > > > the > > > > > > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are > > > > > > > > > > > > > > running > > > > > > > > > > > > > > on > > > > > > > > > > > > > > the > > > > > > > > > > > > > > latest > > > > > > > > > > > > > > rev > > > > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > Rizwan" , "Michael > > > > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > I was able to reproduce the problem with > > > > > > > > > > > > > > > persistent > > > > > > > > > > > > > > > coasters > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > 10:30:48 > > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > Could you also forward the attachments > > > > > > > > > > > > > > > > please? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael > > > > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > I think I am seeing a similar deadlock > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > > > > and am trying to get a clean log and > > > > > > > > > > > > > > > > > jstack > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem > > > > > > > > > > > > > > > > > as > > > > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > > Wilde" , "Michael > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > > > 1:56:13 PM > > > > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > > > > Attached are the jstack output and the > > > > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Michael Wilde > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > -- > > > > > > > > Michael Wilde > > > > > > > > Computation Institute, University of Chicago > > > > > > > > Mathematics and Computer Science Division > > > > > > > > Argonne National Laboratory > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > From ketancmaheshwari at gmail.com Wed Sep 21 16:24:06 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Wed, 21 Sep 2011 16:24:06 -0500 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1315873153.2945.0.camel@blabla> References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> Message-ID: Hi Mihael, I tested this fix. It seems that the timeout issue for large-ish data and throttle > ~30 persists. I am not sure if this is data staging timeout though. The setup that fails is as follows: persistent coasters, resource= workers running on OSG data size=8MB, 100 data items. foreach throttle=40=jobthrottle. The standard output seems intermittently showing some activity and then getting back to no activity without any progress on tasks. Please find the log and stdouterr here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err, http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB displayed a fat tail behavior though, ~94 tasks completing steadily and quickly while the last 5-6 tasks taking disproportionate times. The throttle in these cases was <= 30. Regards, Ketan On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan wrote: > Try now please (cog r3262). > > On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > I tried with the new worker.pl, running a 100 task 10MB per task run > > with throttle set at 100. > > > > > > However, it seems to have failed with the same symptoms of timeout > > error 521: > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.execution.JobException: Job > > failed with an exit code of 521 > > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 Submitted:53 > > Active:1 Failed:46 > > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 Submitted:53 > > Active:1 Failed:46 > > Exception in cat: > > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt] > > Host: grid > > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk > > - - - > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.execution.JobException: Job > > failed with an exit code of 521 > > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 Submitted:52 > > Active:1 Failed:47 > > Exception in cat: > > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt] > > Host: grid > > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk > > > > > > I had about 107 workers running at the time of these failures. > > > > > > I started seeing the failure messages after about 20 minutes into this > > run. > > > > > > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz > > > > > > Regards, > > Ketan > > > > > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > > wrote: > > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari wrote: > > > > > After some discussion with Mike, Our conclusion from these > > runs was > > > that the parallel data transfers are causing timeouts from > > the > > > worker.pl, further, we were undecided if somehow the timeout > > threshold > > > is set too agressive plus how are they determined and > > whether a change > > > in that value could resolve the issue. > > > > > > Something like that. Worker.pl would use the time when a file > > transfer > > started to determine timeouts. This is undesirable. The > > purpose of > > timeouts is to determine whether the other side has stopped > > from > > properly following the flow of things. It follows that any > > kind of > > activity should reset the timeout... timer. > > > > I updated the worker code to deal with the issue in a proper > > way. But > > now I need your help. This is perl code, and it needs testing. > > > > So can you re-run, first with some simple test that uses > > coaster staging > > (just to make sure I didn't mess something up), and then the > > version of > > your tests that was most likely to fail? > > > > > > > > > > > > -- > > Ketan > > > > > > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Thu Sep 22 08:42:57 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 22 Sep 2011 08:42:57 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1316588938.22874.0.camel@blabla> Message-ID: <1025720302.115474.1316698977418.JavaMail.root@zimbra-mb2.anl.gov> I tried re-running the script yesterday. It ran for approximately 20 hours before the coaster channel died and I saw a bunch of timeout errors in the logs. I'm not sure if this is a bug or if there was some type of network/service disruption. The worker scripts and coaster service are still running but no more work is being done. I did not run into any of the previous deadlock issues. http://www.ci.uchicago.edu/~davidk/swat5/cce_ua-20110922-0411-x8yogrt8.log http://www.ci.uchicago.edu/~davidk/swat5/coaster.log David ----- Original Message ----- > From: "Mihael Hategan" > To: "David Kelly" > Cc: "swift-devel Devel" , "Papia Rizwan" , "Michael Wilde" > > Sent: Wednesday, September 21, 2011 2:08:58 AM > Subject: Re: [Swift-devel] swift 0.93 deadlock > Fix in r5143. Please test. > > On Mon, 2011-09-19 at 22:20 -0500, David Kelly wrote: > > I tried today with the 0.93 update. It ran for approximately 7 hours > > before freezing. It looks to be happening in a different place this > > time. > > > > http://www.ci.uchicago.edu/~davidk/swat4/jstack.log > > http://www.ci.uchicago.edu/~davidk/swat4/cce_ua-20110919-1955-h7t8iui2.log > > > > David > > > > > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "Michael Wilde" > > > Cc: "David Kelly" , "swift-devel Devel" > > > , "Papia Rizwan" > > > > > > Sent: Saturday, September 17, 2011 11:36:25 PM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > I have a tentative fix in the branch and trunk. Revisions 5123 and > > > 5124, > > > respectively. Please let me know how that works out. > > > > > > Mihae > > > > > > On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > > > > David and Papia, can you report to the list what the status is > > > > of > > > > running the SWAT app? > > > > > > > > - I understand that Mihael will work on the 0.93 deadlock fix > > > > this > > > > weekend, which is great. > > > > > > > > - I understand that its happening on trunk as well > > > > > > > > - Papia, can you try to "perturb" the Swift code in the hopes > > > > that > > > > some equivalent but different code doesnt trip into the same > > > > bug? Ie > > > > try a different mapper, different variable strategy (ie arrays > > > > vs > > > > scalars, structs vs separate vars) just to see if you can work > > > > around this? Or, put in some shell logic to catch the hang and > > > > kill > > > > and re-run (or resume) Swift? if you just kill a hung script and > > > > then resume it, will it work? We could maybe alter the hang > > > > checker > > > > to kill swift on its own, with a return code or message that you > > > > could use to trigger a resume. > > > > > > > > Mike > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "David Kelly" > > > > > To: "Michael Wilde" > > > > > Cc: "swift-devel Devel" , "Papia > > > > > Rizwan" > > > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > I was able to get it running on PADS with trunk. I ran into > > > > > the > > > > > same > > > > > issue. > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > > > > > David > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , > > > > > > "Papia > > > > > > Rizwan" > > > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using > > > > > > passive > > > > > > persistent coasters. Is there a way to use automatic > > > > > > coasters on > > > > > > the > > > > > > MCS workstations? I'll try copying this over to PADS and > > > > > > running > > > > > > there > > > > > > to see if I can reproduce it. > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Michael Wilde" > > > > > > > To: "David Kelly" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > Can you make SWAT run under trunk, as Papia is testing > > > > > > > using > > > > > > > standard > > > > > > > auto coasters, and doesnt need any of the missing > > > > > > > coaster-service > > > > > > > options. > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > I got past the compilation errors by renaming the all > > > > > > > > functions > > > > > > > > with > > > > > > > > capitalization, but ran into an issue with > > > > > > > > coaster-service. > > > > > > > > Last > > > > > > > > week > > > > > > > > I noticed coaster-service was missing options for > > > > > > > > dynamic > > > > > > > > ports. > > > > > > > > I > > > > > > > > found today that it is also missing -passive. I'll try > > > > > > > > to > > > > > > > > track > > > > > > > > down > > > > > > > > where this changed and restore the previous version. > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Michael Wilde" > > > > > > > > > To: "David Kelly" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > Excellent, thanks - thats good. I also just verified > > > > > > > > > that > > > > > > > > > Papia > > > > > > > > > is > > > > > > > > > not > > > > > > > > > using the overAllocation tags in the sites file, so > > > > > > > > > this > > > > > > > > > problem > > > > > > > > > is > > > > > > > > > clearly a Java deadlock and has nothing to do with the > > > > > > > > > scheduling > > > > > > > > > problem that the (now fixed) overAllocation problem > > > > > > > > > was > > > > > > > > > causing.. > > > > > > > > > > > > > > > > > > My understanding is that this SWAT script is failing > > > > > > > > > under > > > > > > > > > trunk > > > > > > > > > because of the recent token case handling issue (I > > > > > > > > > think > > > > > > > > > the > > > > > > > > > camel-case one). Can you work with Papia to see if > > > > > > > > > either > > > > > > > > > that > > > > > > > > > issue > > > > > > > > > is now fixed, or if her script can be changed to avoid > > > > > > > > > that, > > > > > > > > > so > > > > > > > > > that > > > > > > > > > you can both test the SWAT script with trunk, to see > > > > > > > > > if > > > > > > > > > the > > > > > > > > > deadlock > > > > > > > > > still occurs? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "David Kelly" > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > I narrowed down the problem a bit. Last night I ran > > > > > > > > > > jstack > > > > > > > > > > on > > > > > > > > > > the > > > > > > > > > > wrong java process which is why it didn't report a > > > > > > > > > > deadlock. > > > > > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > > > > > My jstack: > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > > > Papia's jstack: > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > David, it sounds like more analysis is needed > > > > > > > > > > > here. If > > > > > > > > > > > the > > > > > > > > > > > SWAT > > > > > > > > > > > runs > > > > > > > > > > > are not showing a deadlock (but your runs are) > > > > > > > > > > > then > > > > > > > > > > > likely > > > > > > > > > > > we > > > > > > > > > > > have > > > > > > > > > > > two > > > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing > > > > > > > > > > > to > > > > > > > > > > > progress > > > > > > > > > > > is > > > > > > > > > > > due > > > > > > > > > > > to > > > > > > > > > > > the overAllocation parameter problem that Mihael > > > > > > > > > > > fixed > > > > > > > > > > > yesterday. > > > > > > > > > > > The > > > > > > > > > > > symptom there is that Swift starts a coaster with > > > > > > > > > > > a > > > > > > > > > > > time > > > > > > > > > > > slot > > > > > > > > > > > too > > > > > > > > > > > small for the apps in the script, and no apps wind > > > > > > > > > > > up > > > > > > > > > > > running. > > > > > > > > > > > I > > > > > > > > > > > think > > > > > > > > > > > that situation in general merits a separate > > > > > > > > > > > ticket, > > > > > > > > > > > and > > > > > > > > > > > may > > > > > > > > > > > have > > > > > > > > > > > been > > > > > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are > > > > > > > > > > > hanging > > > > > > > > > > > for > > > > > > > > > > > a > > > > > > > > > > > reason > > > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > The jstack log corresponds to the most recent > > > > > > > > > > > > log > > > > > > > > > > > > file - > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > > > jstack does not report any deadlocks, but I > > > > > > > > > > > > thought > > > > > > > > > > > > it > > > > > > > > > > > > might > > > > > > > > > > > > be > > > > > > > > > > > > useful > > > > > > > > > > > > so I included it. Swift was not making any > > > > > > > > > > > > progress > > > > > > > > > > > > for > > > > > > > > > > > > about > > > > > > > > > > > > 5 > > > > > > > > > > > > hours > > > > > > > > > > > > before I sent the logs. I am running the latest > > > > > > > > > > > > 0.93 > > > > > > > > > > > > branch. > > > > > > > > > > > > I > > > > > > > > > > > > will > > > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > David, which of the many Swift logs in that > > > > > > > > > > > > > /swat > > > > > > > > > > > > > dir > > > > > > > > > > > > > does > > > > > > > > > > > > > the > > > > > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are > > > > > > > > > > > > > running > > > > > > > > > > > > > on > > > > > > > > > > > > > the > > > > > > > > > > > > > latest > > > > > > > > > > > > > rev > > > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > Rizwan" , "Michael > > > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > I was able to reproduce the problem with > > > > > > > > > > > > > > persistent > > > > > > > > > > > > > > coasters > > > > > > > > > > > > > > on > > > > > > > > > > > > > > the > > > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > 10:30:48 > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > Could you also forward the attachments > > > > > > > > > > > > > > > please? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael > > > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > I think I am seeing a similar deadlock > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > > > and am trying to get a clean log and > > > > > > > > > > > > > > > > jstack > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem > > > > > > > > > > > > > > > > as > > > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > Wilde" , "Michael > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > > 1:56:13 PM > > > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > > > Attached are the jstack output and the > > > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Michael Wilde > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Michael Wilde > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > -- > > > > > > > Michael Wilde > > > > > > > Computation Institute, University of Chicago > > > > > > > Mathematics and Computer Science Division > > > > > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > From wilde at mcs.anl.gov Thu Sep 22 09:56:22 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 22 Sep 2011 09:56:22 -0500 (CDT) Subject: [Swift-devel] Notes for log processing In-Reply-To: <1725004319.28826.1316696863610.JavaMail.root@zimbra.anl.gov> Message-ID: <2054923131.29369.1316703382158.JavaMail.root@zimbra.anl.gov> I wanted to do some basic plots to better understand the performance and bottlenecks of a Swift script. I started with this doc on log processing: http://www.ci.uchicago.edu/swift/wwwdev/guides/trunk/userguide/userguide.html#_log_processing Here's what I had to do to generate a load plot of my Swift run. Hopefully this will help in documenting how to use the new Java plotting configs now, as well as packaging them for easy use with higher level scripts. - Mike # Point to the log processing tools lp=/homes/wilde/swift/src/0.93/cog/modules/swift/dist/swift-svn/libexec/log-processing # fix normalize-log.pl: FH -> START # find the start time of the log in decimal unixtime $ $lp/normalize-log.pl stime amwg_stats-20110922-0541-ii4a5z96.iso >amwg_stats-20110922-0541-ii4a5z96.norm # convert ISO time to UNIX time in decimal seconds $ $lp/iso-to-secs amwg_stats-20110922-0541-ii4a5z96.iso # Normalize the log to start at time 0.0 $ $lp/normalize-log.pl stime amwg_stats-20110922-0541-ii4a5z96.iso >amwg_stats-20110922-0541-ii4a5z96.norm This gives, eg: 0 DEBUG Loader arguments: [-config, cf.properties, ... 0.00199985504150391 DEBUG Loader Max heap: 257294336 0.0199999809265137 DEBUG textfiles BEGIN CDM FILE: 0.0199999809265137 DEBUG textfiles END CDM FILE: 0.622999906539917 DEBUG textfiles BEGIN SWIFTSCRIPT: 0.623999834060669 DEBUG textfiles END SWIFTSCRIPT: # Install the new plotter tools $ svn co https://svn.ci.uchicago.edu/svn/vdl2/usertools/plotter U plotter Checked out revision 5151. fusion$ cd plotter fusion$ ant Buildfile: build.xml compile: [javac] Compiling 7 source files to /fusion/gpfs/home/wilde/amwg/run01/plotter/build jar: [jar] Building jar: /fusion/gpfs/home/wilde/amwg/run01/plotter/lib/plotter.jar BUILD SUCCESSFUL Total time: 2 seconds # Edit load.fg (or change your plot files to match it) # cp $lp/load.cfg . # then edit to contain: xlabel = time ylabel = load shape.amwg_stats-20110922-0541-ii4a5z96.load.data = none label.amwg_stats-20110922-0541-ii4a5z96.load.data = load # (was eg: label.load.data = load) # Generate a Load plot: $ ./plotter/swift_plotter.zsh -s ./load.cfg load.eps *96.load.data From wozniak at mcs.anl.gov Thu Sep 22 11:05:25 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Thu, 22 Sep 2011 11:05:25 -0500 (Central Daylight Time) Subject: [Swift-devel] Notes for log processing In-Reply-To: <2054923131.29369.1316703382158.JavaMail.root@zimbra.anl.gov> References: <2054923131.29369.1316703382158.JavaMail.root@zimbra.anl.gov> Message-ID: An asciidoc-rendered version of the plotter README is up at: http://www.mcs.anl.gov/~wozniak/plotter-readme.html On Thu, 22 Sep 2011, Michael Wilde wrote: > I wanted to do some basic plots to better understand the performance and bottlenecks of a Swift script. I started with this doc on log processing: > > http://www.ci.uchicago.edu/swift/wwwdev/guides/trunk/userguide/userguide.html#_log_processing > > Here's what I had to do to generate a load plot of my Swift run. Hopefully this will help in documenting how to use the new Java plotting configs now, as well as packaging them for easy use with higher level scripts. > > - Mike > > > # Point to the log processing tools > > lp=/homes/wilde/swift/src/0.93/cog/modules/swift/dist/swift-svn/libexec/log-processing > > # fix normalize-log.pl: FH -> START > > # find the start time of the log in decimal unixtime > > $ $lp/normalize-log.pl stime amwg_stats-20110922-0541-ii4a5z96.iso >amwg_stats-20110922-0541-ii4a5z96.norm > > # convert ISO time to UNIX time in decimal seconds > > $ $lp/iso-to-secs amwg_stats-20110922-0541-ii4a5z96.iso > > # Normalize the log to start at time 0.0 > > $ $lp/normalize-log.pl stime amwg_stats-20110922-0541-ii4a5z96.iso >amwg_stats-20110922-0541-ii4a5z96.norm > > This gives, eg: > > 0 DEBUG Loader arguments: [-config, cf.properties, ... > 0.00199985504150391 DEBUG Loader Max heap: 257294336 > 0.0199999809265137 DEBUG textfiles BEGIN CDM FILE: > 0.0199999809265137 DEBUG textfiles END CDM FILE: > 0.622999906539917 DEBUG textfiles BEGIN SWIFTSCRIPT: > 0.623999834060669 DEBUG textfiles END SWIFTSCRIPT: > > # Install the new plotter tools > > $ svn co https://svn.ci.uchicago.edu/svn/vdl2/usertools/plotter > U plotter > Checked out revision 5151. > fusion$ cd plotter > fusion$ ant > Buildfile: build.xml > > compile: > [javac] Compiling 7 source files to /fusion/gpfs/home/wilde/amwg/run01/plotter/build > > jar: > [jar] Building jar: /fusion/gpfs/home/wilde/amwg/run01/plotter/lib/plotter.jar > > BUILD SUCCESSFUL > Total time: 2 seconds > > > # Edit load.fg (or change your plot files to match it) > > # cp $lp/load.cfg . # then edit to contain: > > xlabel = time > ylabel = load > > shape.amwg_stats-20110922-0541-ii4a5z96.load.data = none > label.amwg_stats-20110922-0541-ii4a5z96.load.data = load > > # (was eg: label.load.data = load) > > # Generate a Load plot: > > $ ./plotter/swift_plotter.zsh -s ./load.cfg load.eps *96.load.data > -- Justin M Wozniak From hategan at mcs.anl.gov Thu Sep 22 13:57:05 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 22 Sep 2011 11:57:05 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> Message-ID: <1316717825.6012.3.camel@blabla> What I see in the log is the error about the invalid path, which, as I mentioned before, is an issue of var_str seemingly being empty (you may want to trace its value though to confirm). I don't see anything about a stagein/out issue. Mihael On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote: > Hi Mihael, > > > I tested this fix. It seems that the timeout issue for large-ish data > and throttle > ~30 persists. I am not sure if this is data staging > timeout though. > > > The setup that fails is as follows: > > > persistent coasters, resource= workers running on OSG > data size=8MB, 100 data items. > foreach throttle=40=jobthrottle. > > > The standard output seems intermittently showing some activity and > then getting back to no activity without any progress on tasks. > > > Please find the log and stdouterr > here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err, > http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log > > > When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB > displayed a fat tail behavior though, ~94 tasks completing steadily > and quickly while the last 5-6 tasks taking disproportionate times. > The throttle in these cases was <= 30. > > > > > Regards, > Ketan > > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan > wrote: > Try now please (cog r3262). > > On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote: > > > > Mihael, > > > > > > I tried with the new worker.pl, running a 100 task 10MB per > task run > > with throttle set at 100. > > > > > > However, it seems to have failed with the same symptoms of > timeout > > error 521: > > > > > > Caused by: null > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > Job > > failed with an exit code of 521 > > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 > Submitted:53 > > Active:1 Failed:46 > > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 > Submitted:53 > > Active:1 Failed:46 > > Exception in cat: > > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt] > > Host: grid > > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk > > - - - > > > > > > Caused by: null > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > Job > > failed with an exit code of 521 > > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 > Submitted:52 > > Active:1 Failed:47 > > Exception in cat: > > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt] > > Host: grid > > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk > > > > > > I had about 107 workers running at the time of these > failures. > > > > > > I started seeing the failure messages after about 20 minutes > into this > > run. > > > > > > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz > > > > > > Regards, > > Ketan > > > > > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > > > wrote: > > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari > wrote: > > > > > After some discussion with Mike, Our conclusion > from these > > runs was > > > that the parallel data transfers are causing > timeouts from > > the > > > worker.pl, further, we were undecided if somehow > the timeout > > threshold > > > is set too agressive plus how are they determined > and > > whether a change > > > in that value could resolve the issue. > > > > > > Something like that. Worker.pl would use the time > when a file > > transfer > > started to determine timeouts. This is undesirable. > The > > purpose of > > timeouts is to determine whether the other side has > stopped > > from > > properly following the flow of things. It follows > that any > > kind of > > activity should reset the timeout... timer. > > > > I updated the worker code to deal with the issue in > a proper > > way. But > > now I need your help. This is perl code, and it > needs testing. > > > > So can you re-run, first with some simple test that > uses > > coaster staging > > (just to make sure I didn't mess something up), and > then the > > version of > > your tests that was most likely to fail? > > > > > > > > > > > > -- > > Ketan > > > > > > > > > > > > > > -- > Ketan > > > From hategan at mcs.anl.gov Thu Sep 22 14:03:23 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 22 Sep 2011 12:03:23 -0700 Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1025720302.115474.1316698977418.JavaMail.root@zimbra-mb2.anl.gov> References: <1025720302.115474.1316698977418.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1316718203.6012.6.camel@blabla> I'm not sure what's going on there. Basically the very first command that is run after a channel is established in order to configure various parameters is timing out. I've never seen this before. Maybe having debug statements from the service might help. You'll need to figure out what log4j.properties the service is using and hack that to display debug messages for: org.globus.cog.karajan.workflow.service and org.globus.cog.abstraction.impl.execution.coaster Let me know if you have trouble with that. Mihael On Thu, 2011-09-22 at 08:42 -0500, David Kelly wrote: > I tried re-running the script yesterday. It ran for approximately 20 hours before the coaster channel died and I saw a bunch of timeout errors in the logs. I'm not sure if this is a bug or if there was some type of network/service disruption. The worker scripts and coaster service are still running but no more work is being done. I did not run into any of the previous deadlock issues. > > http://www.ci.uchicago.edu/~davidk/swat5/cce_ua-20110922-0411-x8yogrt8.log > http://www.ci.uchicago.edu/~davidk/swat5/coaster.log > > David > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "David Kelly" > > Cc: "swift-devel Devel" , "Papia Rizwan" , "Michael Wilde" > > > > Sent: Wednesday, September 21, 2011 2:08:58 AM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > Fix in r5143. Please test. > > > > On Mon, 2011-09-19 at 22:20 -0500, David Kelly wrote: > > > I tried today with the 0.93 update. It ran for approximately 7 hours > > > before freezing. It looks to be happening in a different place this > > > time. > > > > > > http://www.ci.uchicago.edu/~davidk/swat4/jstack.log > > > http://www.ci.uchicago.edu/~davidk/swat4/cce_ua-20110919-1955-h7t8iui2.log > > > > > > David > > > > > > > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "Michael Wilde" > > > > Cc: "David Kelly" , "swift-devel Devel" > > > > , "Papia Rizwan" > > > > > > > > Sent: Saturday, September 17, 2011 11:36:25 PM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > I have a tentative fix in the branch and trunk. Revisions 5123 and > > > > 5124, > > > > respectively. Please let me know how that works out. > > > > > > > > Mihae > > > > > > > > On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > > > > > David and Papia, can you report to the list what the status is > > > > > of > > > > > running the SWAT app? > > > > > > > > > > - I understand that Mihael will work on the 0.93 deadlock fix > > > > > this > > > > > weekend, which is great. > > > > > > > > > > - I understand that its happening on trunk as well > > > > > > > > > > - Papia, can you try to "perturb" the Swift code in the hopes > > > > > that > > > > > some equivalent but different code doesnt trip into the same > > > > > bug? Ie > > > > > try a different mapper, different variable strategy (ie arrays > > > > > vs > > > > > scalars, structs vs separate vars) just to see if you can work > > > > > around this? Or, put in some shell logic to catch the hang and > > > > > kill > > > > > and re-run (or resume) Swift? if you just kill a hung script and > > > > > then resume it, will it work? We could maybe alter the hang > > > > > checker > > > > > to kill swift on its own, with a return code or message that you > > > > > could use to trigger a resume. > > > > > > > > > > Mike > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "David Kelly" > > > > > > To: "Michael Wilde" > > > > > > Cc: "swift-devel Devel" , "Papia > > > > > > Rizwan" > > > > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > I was able to get it running on PADS with trunk. I ran into > > > > > > the > > > > > > same > > > > > > issue. > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "David Kelly" > > > > > > > To: "Michael Wilde" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" > > > > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using > > > > > > > passive > > > > > > > persistent coasters. Is there a way to use automatic > > > > > > > coasters on > > > > > > > the > > > > > > > MCS workstations? I'll try copying this over to PADS and > > > > > > > running > > > > > > > there > > > > > > > to see if I can reproduce it. > > > > > > > > > > > > > > David > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Michael Wilde" > > > > > > > > To: "David Kelly" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > Can you make SWAT run under trunk, as Papia is testing > > > > > > > > using > > > > > > > > standard > > > > > > > > auto coasters, and doesnt need any of the missing > > > > > > > > coaster-service > > > > > > > > options. > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "David Kelly" > > > > > > > > > To: "Michael Wilde" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > I got past the compilation errors by renaming the all > > > > > > > > > functions > > > > > > > > > with > > > > > > > > > capitalization, but ran into an issue with > > > > > > > > > coaster-service. > > > > > > > > > Last > > > > > > > > > week > > > > > > > > > I noticed coaster-service was missing options for > > > > > > > > > dynamic > > > > > > > > > ports. > > > > > > > > > I > > > > > > > > > found today that it is also missing -passive. I'll try > > > > > > > > > to > > > > > > > > > track > > > > > > > > > down > > > > > > > > > where this changed and restore the previous version. > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > To: "David Kelly" > > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > Excellent, thanks - thats good. I also just verified > > > > > > > > > > that > > > > > > > > > > Papia > > > > > > > > > > is > > > > > > > > > > not > > > > > > > > > > using the overAllocation tags in the sites file, so > > > > > > > > > > this > > > > > > > > > > problem > > > > > > > > > > is > > > > > > > > > > clearly a Java deadlock and has nothing to do with the > > > > > > > > > > scheduling > > > > > > > > > > problem that the (now fixed) overAllocation problem > > > > > > > > > > was > > > > > > > > > > causing.. > > > > > > > > > > > > > > > > > > > > My understanding is that this SWAT script is failing > > > > > > > > > > under > > > > > > > > > > trunk > > > > > > > > > > because of the recent token case handling issue (I > > > > > > > > > > think > > > > > > > > > > the > > > > > > > > > > camel-case one). Can you work with Papia to see if > > > > > > > > > > either > > > > > > > > > > that > > > > > > > > > > issue > > > > > > > > > > is now fixed, or if her script can be changed to avoid > > > > > > > > > > that, > > > > > > > > > > so > > > > > > > > > > that > > > > > > > > > > you can both test the SWAT script with trunk, to see > > > > > > > > > > if > > > > > > > > > > the > > > > > > > > > > deadlock > > > > > > > > > > still occurs? > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > I narrowed down the problem a bit. Last night I ran > > > > > > > > > > > jstack > > > > > > > > > > > on > > > > > > > > > > > the > > > > > > > > > > > wrong java process which is why it didn't report a > > > > > > > > > > > deadlock. > > > > > > > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > > > > > > > My jstack: > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > > > > Papia's jstack: > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > David, it sounds like more analysis is needed > > > > > > > > > > > > here. If > > > > > > > > > > > > the > > > > > > > > > > > > SWAT > > > > > > > > > > > > runs > > > > > > > > > > > > are not showing a deadlock (but your runs are) > > > > > > > > > > > > then > > > > > > > > > > > > likely > > > > > > > > > > > > we > > > > > > > > > > > > have > > > > > > > > > > > > two > > > > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts failing > > > > > > > > > > > > to > > > > > > > > > > > > progress > > > > > > > > > > > > is > > > > > > > > > > > > due > > > > > > > > > > > > to > > > > > > > > > > > > the overAllocation parameter problem that Mihael > > > > > > > > > > > > fixed > > > > > > > > > > > > yesterday. > > > > > > > > > > > > The > > > > > > > > > > > > symptom there is that Swift starts a coaster with > > > > > > > > > > > > a > > > > > > > > > > > > time > > > > > > > > > > > > slot > > > > > > > > > > > > too > > > > > > > > > > > > small for the apps in the script, and no apps wind > > > > > > > > > > > > up > > > > > > > > > > > > running. > > > > > > > > > > > > I > > > > > > > > > > > > think > > > > > > > > > > > > that situation in general merits a separate > > > > > > > > > > > > ticket, > > > > > > > > > > > > and > > > > > > > > > > > > may > > > > > > > > > > > > have > > > > > > > > > > > > been > > > > > > > > > > > > discussed on swift-devel (but quite a while ago). > > > > > > > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs are > > > > > > > > > > > > hanging > > > > > > > > > > > > for > > > > > > > > > > > > a > > > > > > > > > > > > reason > > > > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > The jstack log corresponds to the most recent > > > > > > > > > > > > > log > > > > > > > > > > > > > file - > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > > > > jstack does not report any deadlocks, but I > > > > > > > > > > > > > thought > > > > > > > > > > > > > it > > > > > > > > > > > > > might > > > > > > > > > > > > > be > > > > > > > > > > > > > useful > > > > > > > > > > > > > so I included it. Swift was not making any > > > > > > > > > > > > > progress > > > > > > > > > > > > > for > > > > > > > > > > > > > about > > > > > > > > > > > > > 5 > > > > > > > > > > > > > hours > > > > > > > > > > > > > before I sent the logs. I am running the latest > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > branch. > > > > > > > > > > > > > I > > > > > > > > > > > > > will > > > > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > > David, which of the many Swift logs in that > > > > > > > > > > > > > > /swat > > > > > > > > > > > > > > dir > > > > > > > > > > > > > > does > > > > > > > > > > > > > > the > > > > > > > > > > > > > > jstack.log pertain to? How many of these runs > > > > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) are > > > > > > > > > > > > > > running > > > > > > > > > > > > > > on > > > > > > > > > > > > > > the > > > > > > > > > > > > > > latest > > > > > > > > > > > > > > rev > > > > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > Rizwan" , "Michael > > > > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > I was able to reproduce the problem with > > > > > > > > > > > > > > > persistent > > > > > > > > > > > > > > > coasters > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > 10:30:48 > > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > Could you also forward the attachments > > > > > > > > > > > > > > > > please? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael > > > > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > I think I am seeing a similar deadlock > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > > > > and am trying to get a clean log and > > > > > > > > > > > > > > > > > jstack > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is running > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this problem > > > > > > > > > > > > > > > > > as > > > > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > > Wilde" , "Michael > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > > > 1:56:13 PM > > > > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > > > > Attached are the jstack output and the > > > > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Michael Wilde > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > -- > > > > > > > > Michael Wilde > > > > > > > > Computation Institute, University of Chicago > > > > > > > > Mathematics and Computer Science Division > > > > > > > > Argonne National Laboratory > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > From ketancmaheshwari at gmail.com Thu Sep 22 14:07:53 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 22 Sep 2011 14:07:53 -0500 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1316717825.6012.3.camel@blabla> References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> <1316717825.6012.3.camel@blabla> Message-ID: Mihael, The experiments and logs I sent you above are not from the SCEC workflow. These are just the catsn scripts. The logs also doesn't show anything related to invalid path as such. The var_str invalid path issue still persists though and I am trying to debug it, but that is a completely different one. Regards, Ketan On Thu, Sep 22, 2011 at 1:57 PM, Mihael Hategan wrote: > What I see in the log is the error about the invalid path, which, as I > mentioned before, is an issue of var_str seemingly being empty (you may > want to trace its value though to confirm). I don't see anything about a > stagein/out issue. > > Mihael > > On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote: > > Hi Mihael, > > > > > > I tested this fix. It seems that the timeout issue for large-ish data > > and throttle > ~30 persists. I am not sure if this is data staging > > timeout though. > > > > > > The setup that fails is as follows: > > > > > > persistent coasters, resource= workers running on OSG > > data size=8MB, 100 data items. > > foreach throttle=40=jobthrottle. > > > > > > The standard output seems intermittently showing some activity and > > then getting back to no activity without any progress on tasks. > > > > > > Please find the log and stdouterr > > here: http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err, > > > http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log > > > > > > When I tested with small data, 1MB, 2MB, 4MB, it did work. 4MB > > displayed a fat tail behavior though, ~94 tasks completing steadily > > and quickly while the last 5-6 tasks taking disproportionate times. > > The throttle in these cases was <= 30. > > > > > > > > > > Regards, > > Ketan > > > > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan > > wrote: > > Try now please (cog r3262). > > > > On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari wrote: > > > > > > > Mihael, > > > > > > > > > I tried with the new worker.pl, running a 100 task 10MB per > > task run > > > with throttle set at 100. > > > > > > > > > However, it seems to have failed with the same symptoms of > > timeout > > > error 521: > > > > > > > > > Caused by: null > > > Caused by: > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job > > > failed with an exit code of 521 > > > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 > > Submitted:53 > > > Active:1 Failed:46 > > > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 > > Submitted:53 > > > Active:1 Failed:46 > > > Exception in cat: > > > Arguments: [gpfs/pads/swift/ketan/indir10/data0002.txt] > > > Host: grid > > > Directory: catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job > > > failed with an exit code of 521 > > > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 > > Submitted:52 > > > Active:1 Failed:47 > > > Exception in cat: > > > Arguments: [gpfs/pads/swift/ketan/indir10/data0014.txt] > > > Host: grid > > > Directory: catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk > > > > > > > > > I had about 107 workers running at the time of these > > failures. > > > > > > > > > I started seeing the failure messages after about 20 minutes > > into this > > > run. > > > > > > > > > The logs are in http://www.ci.uchicago.edu/~ketan/pack.tgz > > > > > > > > > Regards, > > > Ketan > > > > > > > > > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > > > > > wrote: > > > On Mon, 2011-09-12 at 11:58 -0500, Ketan Maheshwari > > wrote: > > > > > > > After some discussion with Mike, Our conclusion > > from these > > > runs was > > > > that the parallel data transfers are causing > > timeouts from > > > the > > > > worker.pl, further, we were undecided if somehow > > the timeout > > > threshold > > > > is set too agressive plus how are they determined > > and > > > whether a change > > > > in that value could resolve the issue. > > > > > > > > > Something like that. Worker.pl would use the time > > when a file > > > transfer > > > started to determine timeouts. This is undesirable. > > The > > > purpose of > > > timeouts is to determine whether the other side has > > stopped > > > from > > > properly following the flow of things. It follows > > that any > > > kind of > > > activity should reset the timeout... timer. > > > > > > I updated the worker code to deal with the issue in > > a proper > > > way. But > > > now I need your help. This is perl code, and it > > needs testing. > > > > > > So can you re-run, first with some simple test that > > uses > > > coaster staging > > > (just to make sure I didn't mess something up), and > > then the > > > version of > > > your tests that was most likely to fail? > > > > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 22 14:12:10 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 22 Sep 2011 12:12:10 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> <1316717825.6012.3.camel@blabla> Message-ID: <1316718730.6443.0.camel@blabla> Ah, yes. Sorry. I was looking at the wrong log. On Thu, 2011-09-22 at 14:07 -0500, Ketan Maheshwari wrote: > Mihael, > > > The experiments and logs I sent you above are not from the SCEC > workflow. These are just the catsn scripts. The logs also doesn't show > anything related to invalid path as such. > > > The var_str invalid path issue still persists though and I am trying > to debug it, but that is a completely different one. > > > Regards, > Ketan > > > On Thu, Sep 22, 2011 at 1:57 PM, Mihael Hategan > wrote: > What I see in the log is the error about the invalid path, > which, as I > mentioned before, is an issue of var_str seemingly being > empty (you may > want to trace its value though to confirm). I don't see > anything about a > stagein/out issue. > > Mihael > > > On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote: > > Hi Mihael, > > > > > > I tested this fix. It seems that the timeout issue for > large-ish data > > and throttle > ~30 persists. I am not sure if this is data > staging > > timeout though. > > > > > > The setup that fails is as follows: > > > > > > persistent coasters, resource= workers running on OSG > > data size=8MB, 100 data items. > > foreach throttle=40=jobthrottle. > > > > > > The standard output seems intermittently showing some > activity and > > then getting back to no activity without any progress on > tasks. > > > > > > Please find the log and stdouterr > > here: > http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err, > > > http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log > > > > > > When I tested with small data, 1MB, 2MB, 4MB, it did work. > 4MB > > displayed a fat tail behavior though, ~94 tasks completing > steadily > > and quickly while the last 5-6 tasks taking disproportionate > times. > > The throttle in these cases was <= 30. > > > > > > > > > > Regards, > > Ketan > > > > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan > > > wrote: > > Try now please (cog r3262). > > > > On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari > wrote: > > > > > > > Mihael, > > > > > > > > > I tried with the new worker.pl, running a 100 task > 10MB per > > task run > > > with throttle set at 100. > > > > > > > > > However, it seems to have failed with the same > symptoms of > > timeout > > > error 521: > > > > > > > > > Caused by: null > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job > > > failed with an exit code of 521 > > > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 > > Submitted:53 > > > Active:1 Failed:46 > > > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 > > Submitted:53 > > > Active:1 Failed:46 > > > Exception in cat: > > > Arguments: > [gpfs/pads/swift/ketan/indir10/data0002.txt] > > > Host: grid > > > Directory: > catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk > > > - - - > > > > > > > > > Caused by: null > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job > > > failed with an exit code of 521 > > > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 > > Submitted:52 > > > Active:1 Failed:47 > > > Exception in cat: > > > Arguments: > [gpfs/pads/swift/ketan/indir10/data0014.txt] > > > Host: grid > > > Directory: > catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk > > > > > > > > > I had about 107 workers running at the time of > these > > failures. > > > > > > > > > I started seeing the failure messages after about > 20 minutes > > into this > > > run. > > > > > > > > > The logs are in > http://www.ci.uchicago.edu/~ketan/pack.tgz > > > > > > > > > Regards, > > > Ketan > > > > > > > > > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > > > > > wrote: > > > On Mon, 2011-09-12 at 11:58 -0500, Ketan > Maheshwari > > wrote: > > > > > > > After some discussion with Mike, Our > conclusion > > from these > > > runs was > > > > that the parallel data transfers are > causing > > timeouts from > > > the > > > > worker.pl, further, we were undecided if > somehow > > the timeout > > > threshold > > > > is set too agressive plus how are they > determined > > and > > > whether a change > > > > in that value could resolve the issue. > > > > > > > > > Something like that. Worker.pl would use > the time > > when a file > > > transfer > > > started to determine timeouts. This is > undesirable. > > The > > > purpose of > > > timeouts is to determine whether the other > side has > > stopped > > > from > > > properly following the flow of things. It > follows > > that any > > > kind of > > > activity should reset the timeout... > timer. > > > > > > I updated the worker code to deal with the > issue in > > a proper > > > way. But > > > now I need your help. This is perl code, > and it > > needs testing. > > > > > > So can you re-run, first with some simple > test that > > uses > > > coaster staging > > > (just to make sure I didn't mess something > up), and > > then the > > > version of > > > your tests that was most likely to fail? > > > > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > > > > > > -- > Ketan > > > From ketancmaheshwari at gmail.com Thu Sep 22 15:56:38 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 22 Sep 2011 15:56:38 -0500 Subject: [Swift-devel] persistent coasters gridftp provider fails with error 500 Message-ID: Hi Mihael, I was trying to use persistent coasters on OSG with gridFTP as provider. The script catsn is intended to transfer one small file (<1MB) to an OSG site (gsiftp://cit-gatekeeper.ultralight.org). However, I see that the script fails. The stdout message is: Failed to transfer wrapper log for job cat-tyxj0agk Progress: time: Thu, 22 Sep 2011 15:44:35 -0500 Failed:1 Exception in cat: Arguments: [gpfs/pads/swift/ketan/indir1file/data0000.txt] Host: CIT_CMS_T2__cit-gatekeeper.ultralight.org Directory: catsngridftp-20110922-1544-brgdcf5d/jobs/t/cat-tyxj0agk stderr.txt: stdout.txt: ---- Caused by: null Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 127 Final status: time: Thu, 22 Sep 2011 15:44:35 -0500 Failed:1 The following errors have occurred: 1. Job failed with an exit code of 127 Looking at the log it says the following error: 2011-09-22 15:44:35,716-0500 DEBUG vdl:transferwrapperlog Exception for wrapper log failure from catsngridftp-20110922-1544-brgdcf5d/info/t on CIT_CMS_T2__cit-gateke eper.ultralight.org: null Caused by: org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: Exception in getFile Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Failed to retrieve file information about /raid2/osg-data/engage/scec/swift_scratch/catsngridf tp-20110922-1544-brgdcf5d/info/t/cat-tyxj0agk-info Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server refused MLST command (error code 1) [Nested except ion message: Custom message: Unexpected reply: 500-Command failed : globus_gridftp_server_file.c:globus_l_gfs_file_stat:389: 500-System error in stat: No such file or directory 500-A system call failed: No such file or directory 500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 500-Command failed : globus_gridftp_server_f ile.c:globus_l_gfs_file_stat:389: 500-System error in stat: No such file or directory 500-A system call failed: No such file or directory 500 End.] 2011-09-22 15:44:35,719-0500 INFO vdl:execute END_FAILURE thread=0-6-0-1 tr=cat 2011-09-22 15:44:35,720-0500 INFO vdl:execute Exception in cat: Arguments: [gpfs/pads/swift/ketan/indir1file/data0000.txt] Host: CIT_CMS_T2__cit-gatekeeper.ultralight.org Directory: catsngridftp-20110922-1544-brgdcf5d/jobs/t/cat-tyxj0agk I manually tried to transfer files of small sizes to and from pads to the site in question and it worked ok. The complete log for this run is: http://www.ci.uchicago.edu/~ketan/catsngridftp-20110922-1544-brgdcf5d.log Could you kindly study the log and give any clues as to what is going on. Regards, -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 22 18:03:11 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 22 Sep 2011 16:03:11 -0700 Subject: [Swift-devel] persistent coasters gridftp provider fails with error 500 In-Reply-To: References: Message-ID: <1316732591.7765.14.camel@blabla> Does that machine have /bin/bash? The failure to transfer the wrapper log is normal, since the wrapper didn't actually run, but it is not the source of the problem. On Thu, 2011-09-22 at 15:56 -0500, Ketan Maheshwari wrote: > Hi Mihael, > > > I was trying to use persistent coasters on OSG with gridFTP as > provider. The script catsn is intended to transfer one small file > (<1MB) to an OSG site (gsiftp://cit-gatekeeper.ultralight.org). > > > However, I see that the script fails. The stdout message is: > > > Failed to transfer wrapper log for job cat-tyxj0agk > Progress: time: Thu, 22 Sep 2011 15:44:35 -0500 Failed:1 > Exception in cat: > Arguments: [gpfs/pads/swift/ketan/indir1file/data0000.txt] > Host: CIT_CMS_T2__cit-gatekeeper.ultralight.org > Directory: catsngridftp-20110922-1544-brgdcf5d/jobs/t/cat-tyxj0agk > stderr.txt: > > > stdout.txt: > > > ---- > > > Caused by: null > Caused by: > org.globus.cog.abstraction.impl.common.execution.JobException: Job > failed with an exit code of 127 > Final status: time: Thu, 22 Sep 2011 15:44:35 -0500 Failed:1 > The following errors have occurred: > 1. Job failed with an exit code of 127 > > > Looking at the log it says the following error: > > > 2011-09-22 15:44:35,716-0500 DEBUG vdl:transferwrapperlog Exception > for wrapper log failure from > catsngridftp-20110922-1544-brgdcf5d/info/t on CIT_CMS_T2__cit-gateke > eper.ultralight.org: null > Caused by: > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: > Exception in getFile > Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: > Failed to retrieve file information > about /raid2/osg-data/engage/scec/swift_scratch/catsngridf > tp-20110922-1544-brgdcf5d/info/t/cat-tyxj0agk-info > Caused by: org.globus.ftp.exception.ServerException: Server refused > performing the request. Custom message: Server refused MLST command > (error code 1) [Nested except > ion message: Custom message: Unexpected reply: 500-Command failed : > globus_gridftp_server_file.c:globus_l_gfs_file_stat:389: > 500-System error in stat: No such file or directory > 500-A system call failed: No such file or directory > 500 End.] [Nested exception is > org.globus.ftp.exception.UnexpectedReplyCodeException: Custom > message: Unexpected reply: 500-Command failed : > globus_gridftp_server_f > ile.c:globus_l_gfs_file_stat:389: > 500-System error in stat: No such file or directory > 500-A system call failed: No such file or directory > 500 End.] > 2011-09-22 15:44:35,719-0500 INFO vdl:execute END_FAILURE > thread=0-6-0-1 tr=cat > 2011-09-22 15:44:35,720-0500 INFO vdl:execute Exception in cat: > Arguments: [gpfs/pads/swift/ketan/indir1file/data0000.txt] > Host: CIT_CMS_T2__cit-gatekeeper.ultralight.org > Directory: catsngridftp-20110922-1544-brgdcf5d/jobs/t/cat-tyxj0agk > > > > I manually tried to transfer files of small sizes to and from pads to > the site in question and it worked ok. > > > The complete log for this run is: > http://www.ci.uchicago.edu/~ketan/catsngridftp-20110922-1544-brgdcf5d.log > > > Could you kindly study the log and give any clues as to what is going > on. > > > Regards, > -- > Ketan > > > From ketancmaheshwari at gmail.com Thu Sep 22 21:10:24 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 22 Sep 2011 21:10:24 -0500 Subject: [Swift-devel] persistent coasters gridftp provider fails with error 500 In-Reply-To: <1316732591.7765.14.camel@blabla> References: <1316732591.7765.14.camel@blabla> Message-ID: On Thu, Sep 22, 2011 at 6:03 PM, Mihael Hategan wrote: > Does that machine have /bin/bash? > yes. [bridled:catsn-condor]$ globus-job-run cit-gatekeeper.ultralight.org/bin/bash -c '/bin/hostname' cithep231.ultralight.org > > The failure to transfer the wrapper log is normal, since the wrapper > didn't actually run, but it is not the source of the problem. > > On Thu, 2011-09-22 at 15:56 -0500, Ketan Maheshwari wrote: > > Hi Mihael, > > > > > > I was trying to use persistent coasters on OSG with gridFTP as > > provider. The script catsn is intended to transfer one small file > > (<1MB) to an OSG site (gsiftp://cit-gatekeeper.ultralight.org). > > > > > > However, I see that the script fails. The stdout message is: > > > > > > Failed to transfer wrapper log for job cat-tyxj0agk > > Progress: time: Thu, 22 Sep 2011 15:44:35 -0500 Failed:1 > > Exception in cat: > > Arguments: [gpfs/pads/swift/ketan/indir1file/data0000.txt] > > Host: CIT_CMS_T2__cit-gatekeeper.ultralight.org > > Directory: catsngridftp-20110922-1544-brgdcf5d/jobs/t/cat-tyxj0agk > > stderr.txt: > > > > > > stdout.txt: > > > > > > ---- > > > > > > Caused by: null > > Caused by: > > org.globus.cog.abstraction.impl.common.execution.JobException: Job > > failed with an exit code of 127 > > Final status: time: Thu, 22 Sep 2011 15:44:35 -0500 Failed:1 > > The following errors have occurred: > > 1. Job failed with an exit code of 127 > > > > > > Looking at the log it says the following error: > > > > > > 2011-09-22 15:44:35,716-0500 DEBUG vdl:transferwrapperlog Exception > > for wrapper log failure from > > catsngridftp-20110922-1544-brgdcf5d/info/t on CIT_CMS_T2__cit-gateke > > eper.ultralight.org: null > > Caused by: > > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: > > Exception in getFile > > Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: > > Failed to retrieve file information > > about /raid2/osg-data/engage/scec/swift_scratch/catsngridf > > tp-20110922-1544-brgdcf5d/info/t/cat-tyxj0agk-info > > Caused by: org.globus.ftp.exception.ServerException: Server refused > > performing the request. Custom message: Server refused MLST command > > (error code 1) [Nested except > > ion message: Custom message: Unexpected reply: 500-Command failed : > > globus_gridftp_server_file.c:globus_l_gfs_file_stat:389: > > 500-System error in stat: No such file or directory > > 500-A system call failed: No such file or directory > > 500 End.] [Nested exception is > > org.globus.ftp.exception.UnexpectedReplyCodeException: Custom > > message: Unexpected reply: 500-Command failed : > > globus_gridftp_server_f > > ile.c:globus_l_gfs_file_stat:389: > > 500-System error in stat: No such file or directory > > 500-A system call failed: No such file or directory > > 500 End.] > > 2011-09-22 15:44:35,719-0500 INFO vdl:execute END_FAILURE > > thread=0-6-0-1 tr=cat > > 2011-09-22 15:44:35,720-0500 INFO vdl:execute Exception in cat: > > Arguments: [gpfs/pads/swift/ketan/indir1file/data0000.txt] > > Host: CIT_CMS_T2__cit-gatekeeper.ultralight.org > > Directory: catsngridftp-20110922-1544-brgdcf5d/jobs/t/cat-tyxj0agk > > > > > > > > I manually tried to transfer files of small sizes to and from pads to > > the site in question and it worked ok. > > > > > > The complete log for this run is: > > > http://www.ci.uchicago.edu/~ketan/catsngridftp-20110922-1544-brgdcf5d.log > > > > > > Could you kindly study the log and give any clues as to what is going > > on. > > > > > > Regards, > > -- > > Ketan > > > > > > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 22 21:17:03 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 22 Sep 2011 19:17:03 -0700 Subject: [Swift-devel] persistent coasters gridftp provider fails with error 500 In-Reply-To: References: <1316732591.7765.14.camel@blabla> Message-ID: <1316744223.10375.2.camel@blabla> Is your workdir on a shared filesystem? On Thu, 2011-09-22 at 21:10 -0500, Ketan Maheshwari wrote: > > On Thu, Sep 22, 2011 at 6:03 PM, Mihael Hategan > wrote: > Does that machine have /bin/bash? > > > yes. > > > [bridled:catsn-condor]$ globus-job-run > cit-gatekeeper.ultralight.org /bin/bash -c '/bin/hostname' > cithep231.ultralight.org > > > > > The failure to transfer the wrapper log is normal, since the > wrapper > didn't actually run, but it is not the source of the problem. > > > On Thu, 2011-09-22 at 15:56 -0500, Ketan Maheshwari wrote: > > Hi Mihael, > > > > > > I was trying to use persistent coasters on OSG with gridFTP > as > > provider. The script catsn is intended to transfer one small > file > > (<1MB) to an OSG site > (gsiftp://cit-gatekeeper.ultralight.org). > > > > > > However, I see that the script fails. The stdout message is: > > > > > > Failed to transfer wrapper log for job cat-tyxj0agk > > Progress: time: Thu, 22 Sep 2011 15:44:35 -0500 Failed:1 > > Exception in cat: > > Arguments: [gpfs/pads/swift/ketan/indir1file/data0000.txt] > > Host: CIT_CMS_T2__cit-gatekeeper.ultralight.org > > Directory: > catsngridftp-20110922-1544-brgdcf5d/jobs/t/cat-tyxj0agk > > stderr.txt: > > > > > > stdout.txt: > > > > > > ---- > > > > > > Caused by: null > > Caused by: > > > org.globus.cog.abstraction.impl.common.execution.JobException: > Job > > failed with an exit code of 127 > > Final status: time: Thu, 22 Sep 2011 15:44:35 -0500 > Failed:1 > > The following errors have occurred: > > 1. Job failed with an exit code of 127 > > > > > > Looking at the log it says the following error: > > > > > > 2011-09-22 15:44:35,716-0500 DEBUG vdl:transferwrapperlog > Exception > > for wrapper log failure from > > catsngridftp-20110922-1544-brgdcf5d/info/t on > CIT_CMS_T2__cit-gateke > > eper.ultralight.org: null > > Caused by: > > > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: > > Exception in getFile > > Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: > > Failed to retrieve file information > > about /raid2/osg-data/engage/scec/swift_scratch/catsngridf > > tp-20110922-1544-brgdcf5d/info/t/cat-tyxj0agk-info > > Caused by: org.globus.ftp.exception.ServerException: Server > refused > > performing the request. Custom message: Server refused MLST > command > > (error code 1) [Nested except > > ion message: Custom message: Unexpected reply: 500-Command > failed : > > globus_gridftp_server_file.c:globus_l_gfs_file_stat:389: > > 500-System error in stat: No such file or directory > > 500-A system call failed: No such file or directory > > 500 End.] [Nested exception is > > org.globus.ftp.exception.UnexpectedReplyCodeException: > Custom > > message: Unexpected reply: 500-Command failed : > > globus_gridftp_server_f > > ile.c:globus_l_gfs_file_stat:389: > > 500-System error in stat: No such file or directory > > 500-A system call failed: No such file or directory > > 500 End.] > > 2011-09-22 15:44:35,719-0500 INFO vdl:execute END_FAILURE > > thread=0-6-0-1 tr=cat > > 2011-09-22 15:44:35,720-0500 INFO vdl:execute Exception in > cat: > > Arguments: [gpfs/pads/swift/ketan/indir1file/data0000.txt] > > Host: CIT_CMS_T2__cit-gatekeeper.ultralight.org > > Directory: > catsngridftp-20110922-1544-brgdcf5d/jobs/t/cat-tyxj0agk > > > > > > > > I manually tried to transfer files of small sizes to and > from pads to > > the site in question and it worked ok. > > > > > > The complete log for this run is: > > > http://www.ci.uchicago.edu/~ketan/catsngridftp-20110922-1544-brgdcf5d.log > > > > > > Could you kindly study the log and give any clues as to what > is going > > on. > > > > > > Regards, > > -- > > Ketan > > > > > > > > > > > > > > -- > Ketan > > From ketancmaheshwari at gmail.com Thu Sep 22 22:41:36 2011 From: ketancmaheshwari at gmail.com (Ketan Maheshwari) Date: Thu, 22 Sep 2011 22:41:36 -0500 Subject: [Swift-devel] persistent coasters gridftp provider fails with error 500 In-Reply-To: <1316744223.10375.2.camel@blabla> References: <1316732591.7765.14.camel@blabla> <1316744223.10375.2.camel@blabla> Message-ID: Sorry, the previous test meant /bin/bash is on Bridled. This is the correct one: [bridled:~]$ globus-job-run cit-gatekeeper.ultralight.org /bin/sh -c '/bin/bash -c /bin/hostname' cithep231.ultralight.org Is your workdir on a shared filesystem? > Workdir on this site is : /raid2/osg-data/engage/scec/swift_scratch I tried /tmp: no changes in the behavior. > > On Thu, 2011-09-22 at 21:10 -0500, Ketan Maheshwari wrote: > > > > On Thu, Sep 22, 2011 at 6:03 PM, Mihael Hategan > > wrote: > > Does that machine have /bin/bash? > > > > > > yes. > > > > > > [bridled:catsn-condor]$ globus-job-run > > cit-gatekeeper.ultralight.org /bin/bash -c '/bin/hostname' > > cithep231.ultralight.org > > > > > > > > > > The failure to transfer the wrapper log is normal, since the > > wrapper > > didn't actually run, but it is not the source of the problem. > > > > > > On Thu, 2011-09-22 at 15:56 -0500, Ketan Maheshwari wrote: > > > Hi Mihael, > > > > > > > > > I was trying to use persistent coasters on OSG with gridFTP > > as > > > provider. The script catsn is intended to transfer one small > > file > > > (<1MB) to an OSG site > > (gsiftp://cit-gatekeeper.ultralight.org). > > > > > > > > > However, I see that the script fails. The stdout message is: > > > > > > > > > Failed to transfer wrapper log for job cat-tyxj0agk > > > Progress: time: Thu, 22 Sep 2011 15:44:35 -0500 Failed:1 > > > Exception in cat: > > > Arguments: [gpfs/pads/swift/ketan/indir1file/data0000.txt] > > > Host: CIT_CMS_T2__cit-gatekeeper.ultralight.org > > > Directory: > > catsngridftp-20110922-1544-brgdcf5d/jobs/t/cat-tyxj0agk > > > stderr.txt: > > > > > > > > > stdout.txt: > > > > > > > > > ---- > > > > > > > > > Caused by: null > > > Caused by: > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > Job > > > failed with an exit code of 127 > > > Final status: time: Thu, 22 Sep 2011 15:44:35 -0500 > > Failed:1 > > > The following errors have occurred: > > > 1. Job failed with an exit code of 127 > > > > > > > > > Looking at the log it says the following error: > > > > > > > > > 2011-09-22 15:44:35,716-0500 DEBUG vdl:transferwrapperlog > > Exception > > > for wrapper log failure from > > > catsngridftp-20110922-1544-brgdcf5d/info/t on > > CIT_CMS_T2__cit-gateke > > > eper.ultralight.org: null > > > Caused by: > > > > > > org.globus.cog.abstraction.impl.file.IrrecoverableResourceException: > > > Exception in getFile > > > Caused by: > > org.globus.cog.abstraction.impl.file.FileResourceException: > > > Failed to retrieve file information > > > about /raid2/osg-data/engage/scec/swift_scratch/catsngridf > > > tp-20110922-1544-brgdcf5d/info/t/cat-tyxj0agk-info > > > Caused by: org.globus.ftp.exception.ServerException: Server > > refused > > > performing the request. Custom message: Server refused MLST > > command > > > (error code 1) [Nested except > > > ion message: Custom message: Unexpected reply: 500-Command > > failed : > > > globus_gridftp_server_file.c:globus_l_gfs_file_stat:389: > > > 500-System error in stat: No such file or directory > > > 500-A system call failed: No such file or directory > > > 500 End.] [Nested exception is > > > org.globus.ftp.exception.UnexpectedReplyCodeException: > > Custom > > > message: Unexpected reply: 500-Command failed : > > > globus_gridftp_server_f > > > ile.c:globus_l_gfs_file_stat:389: > > > 500-System error in stat: No such file or directory > > > 500-A system call failed: No such file or directory > > > 500 End.] > > > 2011-09-22 15:44:35,719-0500 INFO vdl:execute END_FAILURE > > > thread=0-6-0-1 tr=cat > > > 2011-09-22 15:44:35,720-0500 INFO vdl:execute Exception in > > cat: > > > Arguments: [gpfs/pads/swift/ketan/indir1file/data0000.txt] > > > Host: CIT_CMS_T2__cit-gatekeeper.ultralight.org > > > Directory: > > catsngridftp-20110922-1544-brgdcf5d/jobs/t/cat-tyxj0agk > > > > > > > > > > > > I manually tried to transfer files of small sizes to and > > from pads to > > > the site in question and it worked ok. > > > > > > > > > The complete log for this run is: > > > > > > http://www.ci.uchicago.edu/~ketan/catsngridftp-20110922-1544-brgdcf5d.log > > > > > > > > > Could you kindly study the log and give any clues as to what > > is going > > > on. > > > > > > > > > Regards, > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > -- Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Sep 23 10:15:08 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 23 Sep 2011 10:15:08 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1316628542.24547.0.camel@blabla> Message-ID: <1359547615.32943.1316790908886.JavaMail.root@zimbra.anl.gov> Mihael, is this 0.93 deadlock (related to mapping.RootArrayDataNode) resolved or still open? Papia, David, does the SWAT workflow now run? - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "Michael Wilde" > Cc: "swift-devel Devel" , "Papia Rizwan" , "David Kelly" > > Sent: Wednesday, September 21, 2011 1:09:02 PM > Subject: Re: [Swift-devel] swift 0.93 deadlock > Not quite. But somewhat close. > > On Wed, 2011-09-21 at 08:39 -0500, Michael Wilde wrote: > > Thanks, Mihael. Is this the same deadlock as the one I reported > > recently for 0.93? > > > > Sent: Tuesday, September 20, 2011 5:34:50 PM > > Subject: [Swift-devel] New 0.93 deadlock - > > mapping.RootArrayDataNode? > > > > - Mike > > ----- Original Message ----- > > > From: "Mihael Hategan" > > > To: "David Kelly" > > > Cc: "swift-devel Devel" , "Papia > > > Rizwan" , "Michael Wilde" > > > > > > Sent: Wednesday, September 21, 2011 2:08:58 AM > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > Fix in r5143. Please test. > > > > > > On Mon, 2011-09-19 at 22:20 -0500, David Kelly wrote: > > > > I tried today with the 0.93 update. It ran for approximately 7 > > > > hours > > > > before freezing. It looks to be happening in a different place > > > > this > > > > time. > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat4/jstack.log > > > > http://www.ci.uchicago.edu/~davidk/swat4/cce_ua-20110919-1955-h7t8iui2.log > > > > > > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Mihael Hategan" > > > > > To: "Michael Wilde" > > > > > Cc: "David Kelly" , "swift-devel > > > > > Devel" > > > > > , "Papia Rizwan" > > > > > > > > > > Sent: Saturday, September 17, 2011 11:36:25 PM > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > I have a tentative fix in the branch and trunk. Revisions 5123 > > > > > and > > > > > 5124, > > > > > respectively. Please let me know how that works out. > > > > > > > > > > Mihae > > > > > > > > > > On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > > > > > > David and Papia, can you report to the list what the status > > > > > > is > > > > > > of > > > > > > running the SWAT app? > > > > > > > > > > > > - I understand that Mihael will work on the 0.93 deadlock > > > > > > fix > > > > > > this > > > > > > weekend, which is great. > > > > > > > > > > > > - I understand that its happening on trunk as well > > > > > > > > > > > > - Papia, can you try to "perturb" the Swift code in the > > > > > > hopes > > > > > > that > > > > > > some equivalent but different code doesnt trip into the same > > > > > > bug? Ie > > > > > > try a different mapper, different variable strategy (ie > > > > > > arrays > > > > > > vs > > > > > > scalars, structs vs separate vars) just to see if you can > > > > > > work > > > > > > around this? Or, put in some shell logic to catch the hang > > > > > > and > > > > > > kill > > > > > > and re-run (or resume) Swift? if you just kill a hung script > > > > > > and > > > > > > then resume it, will it work? We could maybe alter the hang > > > > > > checker > > > > > > to kill swift on its own, with a return code or message that > > > > > > you > > > > > > could use to trigger a resume. > > > > > > > > > > > > Mike > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "David Kelly" > > > > > > > To: "Michael Wilde" > > > > > > > Cc: "swift-devel Devel" , > > > > > > > "Papia > > > > > > > Rizwan" > > > > > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > I was able to get it running on PADS with trunk. I ran > > > > > > > into > > > > > > > the > > > > > > > same > > > > > > > issue. > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > > > > > > > > > David > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" > > > > > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be > > > > > > > > using > > > > > > > > passive > > > > > > > > persistent coasters. Is there a way to use automatic > > > > > > > > coasters on > > > > > > > > the > > > > > > > > MCS workstations? I'll try copying this over to PADS and > > > > > > > > running > > > > > > > > there > > > > > > > > to see if I can reproduce it. > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Michael Wilde" > > > > > > > > > To: "David Kelly" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > Can you make SWAT run under trunk, as Papia is testing > > > > > > > > > using > > > > > > > > > standard > > > > > > > > > auto coasters, and doesnt need any of the missing > > > > > > > > > coaster-service > > > > > > > > > options. > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "David Kelly" > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > I got past the compilation errors by renaming the > > > > > > > > > > all > > > > > > > > > > functions > > > > > > > > > > with > > > > > > > > > > capitalization, but ran into an issue with > > > > > > > > > > coaster-service. > > > > > > > > > > Last > > > > > > > > > > week > > > > > > > > > > I noticed coaster-service was missing options for > > > > > > > > > > dynamic > > > > > > > > > > ports. > > > > > > > > > > I > > > > > > > > > > found today that it is also missing -passive. I'll > > > > > > > > > > try > > > > > > > > > > to > > > > > > > > > > track > > > > > > > > > > down > > > > > > > > > > where this changed and restore the previous version. > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > Excellent, thanks - thats good. I also just > > > > > > > > > > > verified > > > > > > > > > > > that > > > > > > > > > > > Papia > > > > > > > > > > > is > > > > > > > > > > > not > > > > > > > > > > > using the overAllocation tags in the sites file, > > > > > > > > > > > so > > > > > > > > > > > this > > > > > > > > > > > problem > > > > > > > > > > > is > > > > > > > > > > > clearly a Java deadlock and has nothing to do with > > > > > > > > > > > the > > > > > > > > > > > scheduling > > > > > > > > > > > problem that the (now fixed) overAllocation > > > > > > > > > > > problem > > > > > > > > > > > was > > > > > > > > > > > causing.. > > > > > > > > > > > > > > > > > > > > > > My understanding is that this SWAT script is > > > > > > > > > > > failing > > > > > > > > > > > under > > > > > > > > > > > trunk > > > > > > > > > > > because of the recent token case handling issue (I > > > > > > > > > > > think > > > > > > > > > > > the > > > > > > > > > > > camel-case one). Can you work with Papia to see if > > > > > > > > > > > either > > > > > > > > > > > that > > > > > > > > > > > issue > > > > > > > > > > > is now fixed, or if her script can be changed to > > > > > > > > > > > avoid > > > > > > > > > > > that, > > > > > > > > > > > so > > > > > > > > > > > that > > > > > > > > > > > you can both test the SWAT script with trunk, to > > > > > > > > > > > see > > > > > > > > > > > if > > > > > > > > > > > the > > > > > > > > > > > deadlock > > > > > > > > > > > still occurs? > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > I narrowed down the problem a bit. Last night I > > > > > > > > > > > > ran > > > > > > > > > > > > jstack > > > > > > > > > > > > on > > > > > > > > > > > > the > > > > > > > > > > > > wrong java process which is why it didn't report > > > > > > > > > > > > a > > > > > > > > > > > > deadlock. > > > > > > > > > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > > > > > > > > > My jstack: > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > > > > > Papia's jstack: > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > David, it sounds like more analysis is needed > > > > > > > > > > > > > here. If > > > > > > > > > > > > > the > > > > > > > > > > > > > SWAT > > > > > > > > > > > > > runs > > > > > > > > > > > > > are not showing a deadlock (but your runs are) > > > > > > > > > > > > > then > > > > > > > > > > > > > likely > > > > > > > > > > > > > we > > > > > > > > > > > > > have > > > > > > > > > > > > > two > > > > > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts > > > > > > > > > > > > > failing > > > > > > > > > > > > > to > > > > > > > > > > > > > progress > > > > > > > > > > > > > is > > > > > > > > > > > > > due > > > > > > > > > > > > > to > > > > > > > > > > > > > the overAllocation parameter problem that > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > fixed > > > > > > > > > > > > > yesterday. > > > > > > > > > > > > > The > > > > > > > > > > > > > symptom there is that Swift starts a coaster > > > > > > > > > > > > > with > > > > > > > > > > > > > a > > > > > > > > > > > > > time > > > > > > > > > > > > > slot > > > > > > > > > > > > > too > > > > > > > > > > > > > small for the apps in the script, and no apps > > > > > > > > > > > > > wind > > > > > > > > > > > > > up > > > > > > > > > > > > > running. > > > > > > > > > > > > > I > > > > > > > > > > > > > think > > > > > > > > > > > > > that situation in general merits a separate > > > > > > > > > > > > > ticket, > > > > > > > > > > > > > and > > > > > > > > > > > > > may > > > > > > > > > > > > > have > > > > > > > > > > > > > been > > > > > > > > > > > > > discussed on swift-devel (but quite a while > > > > > > > > > > > > > ago). > > > > > > > > > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT runs > > > > > > > > > > > > > are > > > > > > > > > > > > > hanging > > > > > > > > > > > > > for > > > > > > > > > > > > > a > > > > > > > > > > > > > reason > > > > > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 > > > > > > > > > > > > > > AM > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > The jstack log corresponds to the most > > > > > > > > > > > > > > recent > > > > > > > > > > > > > > log > > > > > > > > > > > > > > file - > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > > > > > jstack does not report any deadlocks, but I > > > > > > > > > > > > > > thought > > > > > > > > > > > > > > it > > > > > > > > > > > > > > might > > > > > > > > > > > > > > be > > > > > > > > > > > > > > useful > > > > > > > > > > > > > > so I included it. Swift was not making any > > > > > > > > > > > > > > progress > > > > > > > > > > > > > > for > > > > > > > > > > > > > > about > > > > > > > > > > > > > > 5 > > > > > > > > > > > > > > hours > > > > > > > > > > > > > > before I sent the logs. I am running the > > > > > > > > > > > > > > latest > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > branch. > > > > > > > > > > > > > > I > > > > > > > > > > > > > > will > > > > > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 > > > > > > > > > > > > > > > AM > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > David, which of the many Swift logs in > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > /swat > > > > > > > > > > > > > > > dir > > > > > > > > > > > > > > > does > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > jstack.log pertain to? How many of these > > > > > > > > > > > > > > > runs > > > > > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) > > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > running > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > latest > > > > > > > > > > > > > > > rev > > > > > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > > Rizwan" , > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > 11:04:41 > > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > I was able to reproduce the problem with > > > > > > > > > > > > > > > > persistent > > > > > > > > > > > > > > > > coasters > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > > 10:30:48 > > > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > > Could you also forward the attachments > > > > > > > > > > > > > > > > > please? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, > > > > > > > > > > > > > > > > > Michael > > > > > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > I think I am seeing a similar > > > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > > > > > and am trying to get a clean log and > > > > > > > > > > > > > > > > > > jstack > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is > > > > > > > > > > > > > > > > > > running > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this > > > > > > > > > > > > > > > > > > problem > > > > > > > > > > > > > > > > > > as > > > > > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > > > Wilde" , > > > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, > > > > > > > > > > > > > > > > > > > 2011 > > > > > > > > > > > > > > > > > > > 1:56:13 PM > > > > > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > > > > > Attached are the jstack output and > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > > > Computation Institute, University of > > > > > > > > > > > > > > > Chicago > > > > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Michael Wilde > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Michael Wilde > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > Argonne National Laboratory > > > > > > > > _______________________________________________ > > > > > > > > Swift-devel mailing list > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Sep 23 10:56:48 2011 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 23 Sep 2011 10:56:48 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1359547615.32943.1316790908886.JavaMail.root@zimbra.anl.gov> Message-ID: <54408424.33302.1316793408665.JavaMail.root@zimbra.anl.gov> ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" , "Papia Rizwan" , "David Kelly" > > Cc: "swift-devel Devel" > Sent: Friday, September 23, 2011 10:15:08 AM > Subject: Re: [Swift-devel] swift 0.93 deadlock > Mihael, is this 0.93 deadlock (related to mapping.RootArrayDataNode) > resolved or still open? Ok, I just picked up 5143 in my 0.93 build on Fusion. I'll keep an eye out to see if my not-infrequent deadlock goes away. I see that its in the same code region that I seem to be deadlocking in. When you said earlier in this thread "> Not quite. But somewhat close." - do you suspect yet another deadlock in the same code, or do you believe that your fix addresses the one I reported in addition to the ones that David and Papia were encountering? - Mike > > Papia, David, does the SWAT workflow now run? > > - Mike > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "David Kelly" > > > > Sent: Wednesday, September 21, 2011 1:09:02 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > Not quite. But somewhat close. > > > > On Wed, 2011-09-21 at 08:39 -0500, Michael Wilde wrote: > > > Thanks, Mihael. Is this the same deadlock as the one I reported > > > recently for 0.93? > > > > > > Sent: Tuesday, September 20, 2011 5:34:50 PM > > > Subject: [Swift-devel] New 0.93 deadlock - > > > mapping.RootArrayDataNode? > > > > > > - Mike > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "David Kelly" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Michael Wilde" > > > > > > > > Sent: Wednesday, September 21, 2011 2:08:58 AM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > Fix in r5143. Please test. > > > > > > > > On Mon, 2011-09-19 at 22:20 -0500, David Kelly wrote: > > > > > I tried today with the 0.93 update. It ran for approximately 7 > > > > > hours > > > > > before freezing. It looks to be happening in a different place > > > > > this > > > > > time. > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat4/jstack.log > > > > > http://www.ci.uchicago.edu/~davidk/swat4/cce_ua-20110919-1955-h7t8iui2.log > > > > > > > > > > David > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Mihael Hategan" > > > > > > To: "Michael Wilde" > > > > > > Cc: "David Kelly" , "swift-devel > > > > > > Devel" > > > > > > , "Papia Rizwan" > > > > > > > > > > > > Sent: Saturday, September 17, 2011 11:36:25 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > I have a tentative fix in the branch and trunk. Revisions > > > > > > 5123 > > > > > > and > > > > > > 5124, > > > > > > respectively. Please let me know how that works out. > > > > > > > > > > > > Mihae > > > > > > > > > > > > On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > > > > > > > David and Papia, can you report to the list what the > > > > > > > status > > > > > > > is > > > > > > > of > > > > > > > running the SWAT app? > > > > > > > > > > > > > > - I understand that Mihael will work on the 0.93 deadlock > > > > > > > fix > > > > > > > this > > > > > > > weekend, which is great. > > > > > > > > > > > > > > - I understand that its happening on trunk as well > > > > > > > > > > > > > > - Papia, can you try to "perturb" the Swift code in the > > > > > > > hopes > > > > > > > that > > > > > > > some equivalent but different code doesnt trip into the > > > > > > > same > > > > > > > bug? Ie > > > > > > > try a different mapper, different variable strategy (ie > > > > > > > arrays > > > > > > > vs > > > > > > > scalars, structs vs separate vars) just to see if you can > > > > > > > work > > > > > > > around this? Or, put in some shell logic to catch the hang > > > > > > > and > > > > > > > kill > > > > > > > and re-run (or resume) Swift? if you just kill a hung > > > > > > > script > > > > > > > and > > > > > > > then resume it, will it work? We could maybe alter the > > > > > > > hang > > > > > > > checker > > > > > > > to kill swift on its own, with a return code or message > > > > > > > that > > > > > > > you > > > > > > > could use to trigger a resume. > > > > > > > > > > > > > > Mike > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" > > > > > > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > I was able to get it running on PADS with trunk. I ran > > > > > > > > into > > > > > > > > the > > > > > > > > same > > > > > > > > issue. > > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "David Kelly" > > > > > > > > > To: "Michael Wilde" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" > > > > > > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be > > > > > > > > > using > > > > > > > > > passive > > > > > > > > > persistent coasters. Is there a way to use automatic > > > > > > > > > coasters on > > > > > > > > > the > > > > > > > > > MCS workstations? I'll try copying this over to PADS > > > > > > > > > and > > > > > > > > > running > > > > > > > > > there > > > > > > > > > to see if I can reproduce it. > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > To: "David Kelly" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > Can you make SWAT run under trunk, as Papia is > > > > > > > > > > testing > > > > > > > > > > using > > > > > > > > > > standard > > > > > > > > > > auto coasters, and doesnt need any of the missing > > > > > > > > > > coaster-service > > > > > > > > > > options. > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > I got past the compilation errors by renaming the > > > > > > > > > > > all > > > > > > > > > > > functions > > > > > > > > > > > with > > > > > > > > > > > capitalization, but ran into an issue with > > > > > > > > > > > coaster-service. > > > > > > > > > > > Last > > > > > > > > > > > week > > > > > > > > > > > I noticed coaster-service was missing options for > > > > > > > > > > > dynamic > > > > > > > > > > > ports. > > > > > > > > > > > I > > > > > > > > > > > found today that it is also missing -passive. I'll > > > > > > > > > > > try > > > > > > > > > > > to > > > > > > > > > > > track > > > > > > > > > > > down > > > > > > > > > > > where this changed and restore the previous > > > > > > > > > > > version. > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > Excellent, thanks - thats good. I also just > > > > > > > > > > > > verified > > > > > > > > > > > > that > > > > > > > > > > > > Papia > > > > > > > > > > > > is > > > > > > > > > > > > not > > > > > > > > > > > > using the overAllocation tags in the sites file, > > > > > > > > > > > > so > > > > > > > > > > > > this > > > > > > > > > > > > problem > > > > > > > > > > > > is > > > > > > > > > > > > clearly a Java deadlock and has nothing to do > > > > > > > > > > > > with > > > > > > > > > > > > the > > > > > > > > > > > > scheduling > > > > > > > > > > > > problem that the (now fixed) overAllocation > > > > > > > > > > > > problem > > > > > > > > > > > > was > > > > > > > > > > > > causing.. > > > > > > > > > > > > > > > > > > > > > > > > My understanding is that this SWAT script is > > > > > > > > > > > > failing > > > > > > > > > > > > under > > > > > > > > > > > > trunk > > > > > > > > > > > > because of the recent token case handling issue > > > > > > > > > > > > (I > > > > > > > > > > > > think > > > > > > > > > > > > the > > > > > > > > > > > > camel-case one). Can you work with Papia to see > > > > > > > > > > > > if > > > > > > > > > > > > either > > > > > > > > > > > > that > > > > > > > > > > > > issue > > > > > > > > > > > > is now fixed, or if her script can be changed to > > > > > > > > > > > > avoid > > > > > > > > > > > > that, > > > > > > > > > > > > so > > > > > > > > > > > > that > > > > > > > > > > > > you can both test the SWAT script with trunk, to > > > > > > > > > > > > see > > > > > > > > > > > > if > > > > > > > > > > > > the > > > > > > > > > > > > deadlock > > > > > > > > > > > > still occurs? > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > I narrowed down the problem a bit. Last night > > > > > > > > > > > > > I > > > > > > > > > > > > > ran > > > > > > > > > > > > > jstack > > > > > > > > > > > > > on > > > > > > > > > > > > > the > > > > > > > > > > > > > wrong java process which is why it didn't > > > > > > > > > > > > > report > > > > > > > > > > > > > a > > > > > > > > > > > > > deadlock. > > > > > > > > > > > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > > > > > > > > > > > My jstack: > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > > > > > > Papia's jstack: > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 > > > > > > > > > > > > > > AM > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > David, it sounds like more analysis is > > > > > > > > > > > > > > needed > > > > > > > > > > > > > > here. If > > > > > > > > > > > > > > the > > > > > > > > > > > > > > SWAT > > > > > > > > > > > > > > runs > > > > > > > > > > > > > > are not showing a deadlock (but your runs > > > > > > > > > > > > > > are) > > > > > > > > > > > > > > then > > > > > > > > > > > > > > likely > > > > > > > > > > > > > > we > > > > > > > > > > > > > > have > > > > > > > > > > > > > > two > > > > > > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts > > > > > > > > > > > > > > failing > > > > > > > > > > > > > > to > > > > > > > > > > > > > > progress > > > > > > > > > > > > > > is > > > > > > > > > > > > > > due > > > > > > > > > > > > > > to > > > > > > > > > > > > > > the overAllocation parameter problem that > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > fixed > > > > > > > > > > > > > > yesterday. > > > > > > > > > > > > > > The > > > > > > > > > > > > > > symptom there is that Swift starts a coaster > > > > > > > > > > > > > > with > > > > > > > > > > > > > > a > > > > > > > > > > > > > > time > > > > > > > > > > > > > > slot > > > > > > > > > > > > > > too > > > > > > > > > > > > > > small for the apps in the script, and no > > > > > > > > > > > > > > apps > > > > > > > > > > > > > > wind > > > > > > > > > > > > > > up > > > > > > > > > > > > > > running. > > > > > > > > > > > > > > I > > > > > > > > > > > > > > think > > > > > > > > > > > > > > that situation in general merits a separate > > > > > > > > > > > > > > ticket, > > > > > > > > > > > > > > and > > > > > > > > > > > > > > may > > > > > > > > > > > > > > have > > > > > > > > > > > > > > been > > > > > > > > > > > > > > discussed on swift-devel (but quite a while > > > > > > > > > > > > > > ago). > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT > > > > > > > > > > > > > > runs > > > > > > > > > > > > > > are > > > > > > > > > > > > > > hanging > > > > > > > > > > > > > > for > > > > > > > > > > > > > > a > > > > > > > > > > > > > > reason > > > > > > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 > > > > > > > > > > > > > > > AM > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > The jstack log corresponds to the most > > > > > > > > > > > > > > > recent > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > file - > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > > > > > > jstack does not report any deadlocks, but > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > thought > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > might > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > useful > > > > > > > > > > > > > > > so I included it. Swift was not making any > > > > > > > > > > > > > > > progress > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > about > > > > > > > > > > > > > > > 5 > > > > > > > > > > > > > > > hours > > > > > > > > > > > > > > > before I sent the logs. I am running the > > > > > > > > > > > > > > > latest > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > branch. > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > > Rizwan" , > > > > > > > > > > > > > > > > "Mihael > > > > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 > > > > > > > > > > > > > > > > 5:54:11 > > > > > > > > > > > > > > > > AM > > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > David, which of the many Swift logs in > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > /swat > > > > > > > > > > > > > > > > dir > > > > > > > > > > > > > > > > does > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > jstack.log pertain to? How many of these > > > > > > > > > > > > > > > > runs > > > > > > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) > > > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > running > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > latest > > > > > > > > > > > > > > > > rev > > > > > > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > > > Rizwan" , > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > > 11:04:41 > > > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > > I was able to reproduce the problem > > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > > persistent > > > > > > > > > > > > > > > > > coasters > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > > > 10:30:48 > > > > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift > > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > > > Could you also forward the > > > > > > > > > > > > > > > > > > attachments > > > > > > > > > > > > > > > > > > please? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, > > > > > > > > > > > > > > > > > > Michael > > > > > > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > I think I am seeing a similar > > > > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > > > > > > and am trying to get a clean log > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > jstack > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is > > > > > > > > > > > > > > > > > > > running > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this > > > > > > > > > > > > > > > > > > > problem > > > > > > > > > > > > > > > > > > > as > > > > > > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > > > > Wilde" , > > > > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, > > > > > > > > > > > > > > > > > > > > 2011 > > > > > > > > > > > > > > > > > > > > 1:56:13 PM > > > > > > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > > > > > > Attached are the jstack output > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > > > > Computation Institute, University of > > > > > > > > > > > > > > > > Chicago > > > > > > > > > > > > > > > > Mathematics and Computer Science > > > > > > > > > > > > > > > > Division > > > > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Michael Wilde > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > Argonne National Laboratory > > > > > > > > > _______________________________________________ > > > > > > > > > Swift-devel mailing list > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Fri Sep 23 12:44:53 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 Sep 2011 10:44:53 -0700 Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <54408424.33302.1316793408665.JavaMail.root@zimbra.anl.gov> References: <54408424.33302.1316793408665.JavaMail.root@zimbra.anl.gov> Message-ID: <1316799893.14551.1.camel@blabla> On Fri, 2011-09-23 at 10:56 -0500, Michael Wilde wrote: > Ok, I just picked up 5143 in my 0.93 build on Fusion. I'll keep an > eye out to see if my not-infrequent deadlock goes away. I see that its > in the same code region that I seem to be deadlocking in. > > When you said earlier in this thread "> Not quite. But somewhat > close." - do you suspect yet another deadlock in the same code, or do > you believe that your fix addresses the one I reported in addition to > the ones that David and Papia were encountering? r5143 is a fix specifically for this deadlock. From hategan at mcs.anl.gov Fri Sep 23 12:46:25 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 Sep 2011 10:46:25 -0700 Subject: [Swift-devel] persistent coasters gridftp provider fails with error 500 In-Reply-To: References: <1316732591.7765.14.camel@blabla> <1316744223.10375.2.camel@blabla> Message-ID: <1316799985.14551.3.camel@blabla> On Thu, 2011-09-22 at 22:41 -0500, Ketan Maheshwari wrote: > Sorry, the previous test meant /bin/bash is on Bridled. This is the > correct one: > > > [bridled:~]$ globus-job-run cit-gatekeeper.ultralight.org /bin/sh -c > '/bin/bash -c /bin/hostname' > cithep231.ultralight.org > > > > Is your workdir on a shared filesystem? > > > Workdir on this site is : /raid2/osg-data/engage/scec/swift_scratch > > > I tried /tmp: no changes in the behavior. Are either /raid2 or /tmp shared filesystems visible by both the compute nodes and the head node? From davidk at ci.uchicago.edu Fri Sep 23 14:05:29 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Fri, 23 Sep 2011 14:05:29 -0500 (CDT) Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1359547615.32943.1316790908886.JavaMail.root@zimbra.anl.gov> Message-ID: <1914388243.117722.1316804729039.JavaMail.root@zimbra-mb2.anl.gov> I don't know if the workflow has ever run completely. Since the latest patches on Wednesday I've tried running it three times. The first time it ran for 20 hours but ran into an issue with coaster workers timing out and eventually stopped processing work. The second, started last night, was terminated early this morning when the mcs machine I was running on was rebooted. The third I started this morning. It is currently running and I will email the status as soon as it either fails or finishes. David ----- Original Message ----- > From: "Michael Wilde" > To: "Mihael Hategan" , "Papia Rizwan" , "David Kelly" > > Cc: "swift-devel Devel" > Sent: Friday, September 23, 2011 10:15:08 AM > Subject: Re: [Swift-devel] swift 0.93 deadlock > Mihael, is this 0.93 deadlock (related to mapping.RootArrayDataNode) > resolved or still open? > > Papia, David, does the SWAT workflow now run? > > - Mike > > > ----- Original Message ----- > > From: "Mihael Hategan" > > To: "Michael Wilde" > > Cc: "swift-devel Devel" , "Papia > > Rizwan" , "David Kelly" > > > > Sent: Wednesday, September 21, 2011 1:09:02 PM > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > Not quite. But somewhat close. > > > > On Wed, 2011-09-21 at 08:39 -0500, Michael Wilde wrote: > > > Thanks, Mihael. Is this the same deadlock as the one I reported > > > recently for 0.93? > > > > > > Sent: Tuesday, September 20, 2011 5:34:50 PM > > > Subject: [Swift-devel] New 0.93 deadlock - > > > mapping.RootArrayDataNode? > > > > > > - Mike > > > ----- Original Message ----- > > > > From: "Mihael Hategan" > > > > To: "David Kelly" > > > > Cc: "swift-devel Devel" , "Papia > > > > Rizwan" , "Michael Wilde" > > > > > > > > Sent: Wednesday, September 21, 2011 2:08:58 AM > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > Fix in r5143. Please test. > > > > > > > > On Mon, 2011-09-19 at 22:20 -0500, David Kelly wrote: > > > > > I tried today with the 0.93 update. It ran for approximately 7 > > > > > hours > > > > > before freezing. It looks to be happening in a different place > > > > > this > > > > > time. > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat4/jstack.log > > > > > http://www.ci.uchicago.edu/~davidk/swat4/cce_ua-20110919-1955-h7t8iui2.log > > > > > > > > > > David > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Mihael Hategan" > > > > > > To: "Michael Wilde" > > > > > > Cc: "David Kelly" , "swift-devel > > > > > > Devel" > > > > > > , "Papia Rizwan" > > > > > > > > > > > > Sent: Saturday, September 17, 2011 11:36:25 PM > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > I have a tentative fix in the branch and trunk. Revisions > > > > > > 5123 > > > > > > and > > > > > > 5124, > > > > > > respectively. Please let me know how that works out. > > > > > > > > > > > > Mihae > > > > > > > > > > > > On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote: > > > > > > > David and Papia, can you report to the list what the > > > > > > > status > > > > > > > is > > > > > > > of > > > > > > > running the SWAT app? > > > > > > > > > > > > > > - I understand that Mihael will work on the 0.93 deadlock > > > > > > > fix > > > > > > > this > > > > > > > weekend, which is great. > > > > > > > > > > > > > > - I understand that its happening on trunk as well > > > > > > > > > > > > > > - Papia, can you try to "perturb" the Swift code in the > > > > > > > hopes > > > > > > > that > > > > > > > some equivalent but different code doesnt trip into the > > > > > > > same > > > > > > > bug? Ie > > > > > > > try a different mapper, different variable strategy (ie > > > > > > > arrays > > > > > > > vs > > > > > > > scalars, structs vs separate vars) just to see if you can > > > > > > > work > > > > > > > around this? Or, put in some shell logic to catch the hang > > > > > > > and > > > > > > > kill > > > > > > > and re-run (or resume) Swift? if you just kill a hung > > > > > > > script > > > > > > > and > > > > > > > then resume it, will it work? We could maybe alter the > > > > > > > hang > > > > > > > checker > > > > > > > to kill swift on its own, with a return code or message > > > > > > > that > > > > > > > you > > > > > > > could use to trigger a resume. > > > > > > > > > > > > > > Mike > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "David Kelly" > > > > > > > > To: "Michael Wilde" > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > "Papia > > > > > > > > Rizwan" > > > > > > > > Sent: Thursday, September 15, 2011 4:34:02 PM > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > I was able to get it running on PADS with trunk. I ran > > > > > > > > into > > > > > > > > the > > > > > > > > same > > > > > > > > issue. > > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "David Kelly" > > > > > > > > > To: "Michael Wilde" > > > > > > > > > Cc: "swift-devel Devel" , > > > > > > > > > "Papia > > > > > > > > > Rizwan" > > > > > > > > > Sent: Thursday, September 15, 2011 2:39:47 PM > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > The sites.xml in /homes/papia/SwiftSCE2 seems to be > > > > > > > > > using > > > > > > > > > passive > > > > > > > > > persistent coasters. Is there a way to use automatic > > > > > > > > > coasters on > > > > > > > > > the > > > > > > > > > MCS workstations? I'll try copying this over to PADS > > > > > > > > > and > > > > > > > > > running > > > > > > > > > there > > > > > > > > > to see if I can reproduce it. > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > To: "David Kelly" > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > , > > > > > > > > > > "Papia > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:18:17 PM > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > Can you make SWAT run under trunk, as Papia is > > > > > > > > > > testing > > > > > > > > > > using > > > > > > > > > > standard > > > > > > > > > > auto coasters, and doesnt need any of the missing > > > > > > > > > > coaster-service > > > > > > > > > > options. > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > , > > > > > > > > > > > "Papia > > > > > > > > > > > Rizwan" , "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 2:15:36 PM > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > I got past the compilation errors by renaming the > > > > > > > > > > > all > > > > > > > > > > > functions > > > > > > > > > > > with > > > > > > > > > > > capitalization, but ran into an issue with > > > > > > > > > > > coaster-service. > > > > > > > > > > > Last > > > > > > > > > > > week > > > > > > > > > > > I noticed coaster-service was missing options for > > > > > > > > > > > dynamic > > > > > > > > > > > ports. > > > > > > > > > > > I > > > > > > > > > > > found today that it is also missing -passive. I'll > > > > > > > > > > > try > > > > > > > > > > > to > > > > > > > > > > > track > > > > > > > > > > > down > > > > > > > > > > > where this changed and restore the previous > > > > > > > > > > > version. > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > , > > > > > > > > > > > > "Papia > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > Excellent, thanks - thats good. I also just > > > > > > > > > > > > verified > > > > > > > > > > > > that > > > > > > > > > > > > Papia > > > > > > > > > > > > is > > > > > > > > > > > > not > > > > > > > > > > > > using the overAllocation tags in the sites file, > > > > > > > > > > > > so > > > > > > > > > > > > this > > > > > > > > > > > > problem > > > > > > > > > > > > is > > > > > > > > > > > > clearly a Java deadlock and has nothing to do > > > > > > > > > > > > with > > > > > > > > > > > > the > > > > > > > > > > > > scheduling > > > > > > > > > > > > problem that the (now fixed) overAllocation > > > > > > > > > > > > problem > > > > > > > > > > > > was > > > > > > > > > > > > causing.. > > > > > > > > > > > > > > > > > > > > > > > > My understanding is that this SWAT script is > > > > > > > > > > > > failing > > > > > > > > > > > > under > > > > > > > > > > > > trunk > > > > > > > > > > > > because of the recent token case handling issue > > > > > > > > > > > > (I > > > > > > > > > > > > think > > > > > > > > > > > > the > > > > > > > > > > > > camel-case one). Can you work with Papia to see > > > > > > > > > > > > if > > > > > > > > > > > > either > > > > > > > > > > > > that > > > > > > > > > > > > issue > > > > > > > > > > > > is now fixed, or if her script can be changed to > > > > > > > > > > > > avoid > > > > > > > > > > > > that, > > > > > > > > > > > > so > > > > > > > > > > > > that > > > > > > > > > > > > you can both test the SWAT script with trunk, to > > > > > > > > > > > > see > > > > > > > > > > > > if > > > > > > > > > > > > the > > > > > > > > > > > > deadlock > > > > > > > > > > > > still occurs? > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > - MIke > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > , > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock > > > > > > > > > > > > > I narrowed down the problem a bit. Last night > > > > > > > > > > > > > I > > > > > > > > > > > > > ran > > > > > > > > > > > > > jstack > > > > > > > > > > > > > on > > > > > > > > > > > > > the > > > > > > > > > > > > > wrong java process which is why it didn't > > > > > > > > > > > > > report > > > > > > > > > > > > > a > > > > > > > > > > > > > deadlock. > > > > > > > > > > > > > > > > > > > > > > > > > > Papia and I are seeing the same issue. > > > > > > > > > > > > > > > > > > > > > > > > > > My jstack: > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log > > > > > > > > > > > > > Papia's jstack: > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > It happens in the same place: > > > > > > > > > > > > > > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100) > > > > > > > > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24) > > > > > > > > > > > > > > > > > > > > > > > > > > Filed as bug #559 > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > , > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 > > > > > > > > > > > > > > AM > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > David, it sounds like more analysis is > > > > > > > > > > > > > > needed > > > > > > > > > > > > > > here. If > > > > > > > > > > > > > > the > > > > > > > > > > > > > > SWAT > > > > > > > > > > > > > > runs > > > > > > > > > > > > > > are not showing a deadlock (but your runs > > > > > > > > > > > > > > are) > > > > > > > > > > > > > > then > > > > > > > > > > > > > > likely > > > > > > > > > > > > > > we > > > > > > > > > > > > > > have > > > > > > > > > > > > > > two > > > > > > > > > > > > > > different problems here. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Another case we saw in 0.93 with scripts > > > > > > > > > > > > > > failing > > > > > > > > > > > > > > to > > > > > > > > > > > > > > progress > > > > > > > > > > > > > > is > > > > > > > > > > > > > > due > > > > > > > > > > > > > > to > > > > > > > > > > > > > > the overAllocation parameter problem that > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > fixed > > > > > > > > > > > > > > yesterday. > > > > > > > > > > > > > > The > > > > > > > > > > > > > > symptom there is that Swift starts a coaster > > > > > > > > > > > > > > with > > > > > > > > > > > > > > a > > > > > > > > > > > > > > time > > > > > > > > > > > > > > slot > > > > > > > > > > > > > > too > > > > > > > > > > > > > > small for the apps in the script, and no > > > > > > > > > > > > > > apps > > > > > > > > > > > > > > wind > > > > > > > > > > > > > > up > > > > > > > > > > > > > > running. > > > > > > > > > > > > > > I > > > > > > > > > > > > > > think > > > > > > > > > > > > > > that situation in general merits a separate > > > > > > > > > > > > > > ticket, > > > > > > > > > > > > > > and > > > > > > > > > > > > > > may > > > > > > > > > > > > > > have > > > > > > > > > > > > > > been > > > > > > > > > > > > > > discussed on swift-devel (but quite a while > > > > > > > > > > > > > > ago). > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you determine if indeed Papia's SWAT > > > > > > > > > > > > > > runs > > > > > > > > > > > > > > are > > > > > > > > > > > > > > hanging > > > > > > > > > > > > > > for > > > > > > > > > > > > > > a > > > > > > > > > > > > > > reason > > > > > > > > > > > > > > other than a Java deadlock? > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > Rizwan" , "Mihael > > > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 > > > > > > > > > > > > > > > AM > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > The jstack log corresponds to the most > > > > > > > > > > > > > > > recent > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > file - > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log. > > > > > > > > > > > > > > > jstack does not report any deadlocks, but > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > thought > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > might > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > useful > > > > > > > > > > > > > > > so I included it. Swift was not making any > > > > > > > > > > > > > > > progress > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > about > > > > > > > > > > > > > > > 5 > > > > > > > > > > > > > > > hours > > > > > > > > > > > > > > > before I sent the logs. I am running the > > > > > > > > > > > > > > > latest > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > branch. > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > try again today. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > From: "Michael Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "David Kelly" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > > Rizwan" , > > > > > > > > > > > > > > > > "Mihael > > > > > > > > > > > > > > > > Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Thursday, September 15, 2011 > > > > > > > > > > > > > > > > 5:54:11 > > > > > > > > > > > > > > > > AM > > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > David, which of the many Swift logs in > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > /swat > > > > > > > > > > > > > > > > dir > > > > > > > > > > > > > > > > does > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > jstack.log pertain to? How many of these > > > > > > > > > > > > > > > > runs > > > > > > > > > > > > > > > > deadlocked? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > And, did you verify that you (and Papia) > > > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > running > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > latest > > > > > > > > > > > > > > > > rev > > > > > > > > > > > > > > > > of the 0.93 branch? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > From: "David Kelly" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > > > Rizwan" , > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > > 11:04:41 > > > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 > > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > > I was able to reproduce the problem > > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > > persistent > > > > > > > > > > > > > > > > > coasters > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > MCS servers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The jstack output is at > > > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The full collection of logs are at > > > > > > > > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > > From: "Mihael Hategan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Michael Wilde" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "swift-devel Devel" > > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > > "Papia > > > > > > > > > > > > > > > > > > Rizwan" > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 > > > > > > > > > > > > > > > > > > 10:30:48 > > > > > > > > > > > > > > > > > > PM > > > > > > > > > > > > > > > > > > Subject: Re: [Swift-devel] swift > > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > > > Could you also forward the > > > > > > > > > > > > > > > > > > attachments > > > > > > > > > > > > > > > > > > please? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, > > > > > > > > > > > > > > > > > > Michael > > > > > > > > > > > > > > > > > > Wilde > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > I think I am seeing a similar > > > > > > > > > > > > > > > > > > > deadlock > > > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > ParVis > > > > > > > > > > > > > > > > > > > script, > > > > > > > > > > > > > > > > > > > and am trying to get a clean log > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > jstack > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As far as I can tell, Papia is > > > > > > > > > > > > > > > > > > > running > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > correct > > > > > > > > > > > > > > > > > > > 0.93 > > > > > > > > > > > > > > > > > > > code, > > > > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > > > > please verify. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David will try to replicate this > > > > > > > > > > > > > > > > > > > problem > > > > > > > > > > > > > > > > > > > as > > > > > > > > > > > > > > > > > > > well. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Mike > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > > > > > > > > From: "Papia Rizwan" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "swift-devel Devel" > > > > > > > > > > > > > > > > > > > > , > > > > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > > > > Wilde" , > > > > > > > > > > > > > > > > > > > > "Michael > > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > Shields" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, September 14, > > > > > > > > > > > > > > > > > > > > 2011 > > > > > > > > > > > > > > > > > > > > 1:56:13 PM > > > > > > > > > > > > > > > > > > > > Subject: swift 0.93 deadlock > > > > > > > > > > > > > > > > > > > > Attached are the jstack output > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > > > > > > file. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > Papia Rizwan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > > > > Swift-devel mailing list > > > > > > > > > > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > > > > Computation Institute, University of > > > > > > > > > > > > > > > > Chicago > > > > > > > > > > > > > > > > Mathematics and Computer Science > > > > > > > > > > > > > > > > Division > > > > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Michael Wilde > > > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Michael Wilde > > > > > > > > > > Computation Institute, University of Chicago > > > > > > > > > > Mathematics and Computer Science Division > > > > > > > > > > Argonne National Laboratory > > > > > > > > > _______________________________________________ > > > > > > > > > Swift-devel mailing list > > > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory From hategan at mcs.anl.gov Fri Sep 23 14:51:41 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 Sep 2011 12:51:41 -0700 Subject: [Swift-devel] swift 0.93 deadlock In-Reply-To: <1914388243.117722.1316804729039.JavaMail.root@zimbra-mb2.anl.gov> References: <1914388243.117722.1316804729039.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1316807501.16609.1.camel@blabla> On Fri, 2011-09-23 at 14:05 -0500, David Kelly wrote: > I don't know if the workflow has ever run completely. Since the latest patches on Wednesday I've tried running it three times. > > The first time it ran for 20 hours but ran into an issue with coaster workers timing out and eventually stopped processing work. Did you use provider staging? From jonmon at mcs.anl.gov Fri Sep 23 20:28:04 2011 From: jonmon at mcs.anl.gov (Jonathan Monette) Date: Fri, 23 Sep 2011 20:28:04 -0500 Subject: [Swift-devel] Notes for log processing In-Reply-To: References: <2054923131.29369.1316703382158.JavaMail.root@zimbra.anl.gov> Message-ID: <09450683-D237-4AEB-AF10-FD08DA7CA094@mcs.anl.gov> In Mike's example?.how was the *96.data files made? Did you mean maybe *.norm? On Sep 22, 2011, at 11:05 AM, Justin M Wozniak wrote: > > An asciidoc-rendered version of the plotter README is up at: > > http://www.mcs.anl.gov/~wozniak/plotter-readme.html > > On Thu, 22 Sep 2011, Michael Wilde wrote: > >> I wanted to do some basic plots to better understand the performance and bottlenecks of a Swift script. I started with this doc on log processing: >> >> http://www.ci.uchicago.edu/swift/wwwdev/guides/trunk/userguide/userguide.html#_log_processing >> >> Here's what I had to do to generate a load plot of my Swift run. Hopefully this will help in documenting how to use the new Java plotting configs now, as well as packaging them for easy use with higher level scripts. >> >> - Mike >> >> >> # Point to the log processing tools >> >> lp=/homes/wilde/swift/src/0.93/cog/modules/swift/dist/swift-svn/libexec/log-processing >> >> # fix normalize-log.pl: FH -> START >> >> # find the start time of the log in decimal unixtime >> >> $ $lp/normalize-log.pl stime amwg_stats-20110922-0541-ii4a5z96.iso >amwg_stats-20110922-0541-ii4a5z96.norm >> >> # convert ISO time to UNIX time in decimal seconds >> >> $ $lp/iso-to-secs amwg_stats-20110922-0541-ii4a5z96.iso >> >> # Normalize the log to start at time 0.0 >> >> $ $lp/normalize-log.pl stime amwg_stats-20110922-0541-ii4a5z96.iso >amwg_stats-20110922-0541-ii4a5z96.norm >> >> This gives, eg: >> >> 0 DEBUG Loader arguments: [-config, cf.properties, ... >> 0.00199985504150391 DEBUG Loader Max heap: 257294336 >> 0.0199999809265137 DEBUG textfiles BEGIN CDM FILE: >> 0.0199999809265137 DEBUG textfiles END CDM FILE: >> 0.622999906539917 DEBUG textfiles BEGIN SWIFTSCRIPT: >> 0.623999834060669 DEBUG textfiles END SWIFTSCRIPT: >> >> # Install the new plotter tools >> >> $ svn co https://svn.ci.uchicago.edu/svn/vdl2/usertools/plotter >> U plotter >> Checked out revision 5151. >> fusion$ cd plotter >> fusion$ ant >> Buildfile: build.xml >> >> compile: >> [javac] Compiling 7 source files to /fusion/gpfs/home/wilde/amwg/run01/plotter/build >> >> jar: >> [jar] Building jar: /fusion/gpfs/home/wilde/amwg/run01/plotter/lib/plotter.jar >> >> BUILD SUCCESSFUL >> Total time: 2 seconds >> >> >> # Edit load.fg (or change your plot files to match it) >> >> # cp $lp/load.cfg . # then edit to contain: >> >> xlabel = time >> ylabel = load >> >> shape.amwg_stats-20110922-0541-ii4a5z96.load.data = none >> label.amwg_stats-20110922-0541-ii4a5z96.load.data = load >> >> # (was eg: label.load.data = load) >> >> # Generate a Load plot: >> >> $ ./plotter/swift_plotter.zsh -s ./load.cfg load.eps *96.load.data >> > > -- > Justin M Wozniak > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From wozniak at mcs.anl.gov Mon Sep 26 11:23:30 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 26 Sep 2011 11:23:30 -0500 (Central Daylight Time) Subject: [Swift-devel] Swift input data: @java and string interpolation bug Message-ID: I just committed a fix to @java() wrt to the DSHandle.toString() change. Today, I used it for something like: string s = @java("java.lang.System", "getProperty", "user.dir"); tracef("s: %s\n", s); string t = @java("java.lang.System", "getenv", "PATH"); tracef("t: %s\n", t); which both do what one would expect. I was also looking at string interpolation: the current behavior for: string v = "hi"; string u = "{v}"; results in: Execution failed: Variable not found: v I would like to look at this further but my first thought is that maybe Swift should not use Karajan directly. Justin -- Justin M Wozniak From hategan at mcs.anl.gov Mon Sep 26 14:03:47 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Sep 2011 12:03:47 -0700 Subject: [Swift-devel] Swift input data: @java and string interpolation bug In-Reply-To: References: Message-ID: <1317063827.12024.1.camel@blabla> On Mon, 2011-09-26 at 11:23 -0500, Justin M Wozniak wrote: > string v = "hi"; > string u = "{v}"; > > results in: > > Execution failed: > Variable not found: v The compiler should escape "{" I think. From wozniak at mcs.anl.gov Mon Sep 26 14:13:49 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 26 Sep 2011 14:13:49 -0500 (Central Daylight Time) Subject: [Swift-devel] Swift input data: @java and string interpolation bug In-Reply-To: <1317063827.12024.1.camel@blabla> References: <1317063827.12024.1.camel@blabla> Message-ID: On Mon, 26 Sep 2011, Mihael Hategan wrote: > On Mon, 2011-09-26 at 11:23 -0500, Justin M Wozniak wrote: > >> string v = "hi"; >> string u = "{v}"; >> >> results in: >> >> Execution failed: >> Variable not found: v > > The compiler should escape "{" I think. So should we try to repair this or use a different approach? (I imagine users have seen this.) -- Justin M Wozniak From hategan at mcs.anl.gov Mon Sep 26 17:22:54 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Sep 2011 15:22:54 -0700 Subject: [Swift-devel] Swift input data: @java and string interpolation bug In-Reply-To: References: <1317063827.12024.1.camel@blabla> Message-ID: <1317075774.13959.5.camel@blabla> On Mon, 2011-09-26 at 14:13 -0500, Justin M Wozniak wrote: > > > >> string v = "hi"; > >> string u = "{v}"; > >> > >> results in: > >> > >> Execution failed: > >> Variable not found: v > > > > The compiler should escape "{" I think. > > So should we try to repair this or use a different approach? (I imagine > users have seen this.) > I think that if we are to properly do this, then the intermediate XML should reflect this rather than silently relying on the underlying implementation. Whether we're going to use the same syntax is a different story. But I believe that before we do this, we should discuss the following: - how are various data types going to be converted to string - the exact syntax that we want - whether we're going to support format specifiers (for ints and floats) Mihael From wozniak at mcs.anl.gov Mon Sep 26 17:38:39 2011 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Mon, 26 Sep 2011 17:38:39 -0500 (Central Daylight Time) Subject: [Swift-devel] Swift input data: @java and string interpolation bug In-Reply-To: <1317075774.13959.5.camel@blabla> References: <1317063827.12024.1.camel@blabla> <1317075774.13959.5.camel@blabla> Message-ID: On Mon, 26 Sep 2011, Mihael Hategan wrote: > On Mon, 2011-09-26 at 14:13 -0500, Justin M Wozniak wrote: >>> >>>> string v = "hi"; >>>> string u = "{v}"; >>>> >>>> results in: >>>> >>>> Execution failed: >>>> Variable not found: v >>> >>> The compiler should escape "{" I think. >> >> So should we try to repair this or use a different approach? (I imagine >> users have seen this.) >> > > I think that if we are to properly do this, then the intermediate XML > should reflect this rather than silently relying on the underlying > implementation. > > Whether we're going to use the same syntax is a different story. But I > believe that before we do this, we should discuss the following: > - how are various data types going to be converted to string > - the exact syntax that we want > - whether we're going to support format specifiers (for ints and floats) Oh, I think we should hide the Karajan interpolation from Swift. I think the dataflow from string interpolation could be problematic. My initial suggestion is to replace the generated elements with another element that skips interpolation, if that is in fact where this is coming from. Or we could try to fix the original intended behavior. -- Justin M Wozniak From hategan at mcs.anl.gov Tue Sep 27 00:48:55 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Sep 2011 22:48:55 -0700 Subject: [Swift-devel] Swift input data: @java and string interpolation bug In-Reply-To: References: <1317063827.12024.1.camel@blabla> <1317075774.13959.5.camel@blabla> Message-ID: <1317102535.21797.1.camel@blabla> On Mon, 2011-09-26 at 17:38 -0500, Justin M Wozniak wrote: > On Mon, 26 Sep 2011, Mihael Hategan wrote: > > > > Whether we're going to use the same syntax is a different story. But I > > believe that before we do this, we should discuss the following: > > - how are various data types going to be converted to string > > - the exact syntax that we want > > - whether we're going to support format specifiers (for ints and floats) > > Oh, I think we should hide the Karajan interpolation from Swift. I think > the dataflow from string interpolation could be problematic. Right. That it would. > My initial > suggestion is to replace the generated elements with another > element that skips interpolation, if that is in fact where this is coming > from. Or we could try to fix the original intended behavior. > ... or convert things to a @strcat internally? From hategan at mcs.anl.gov Tue Sep 27 00:51:40 2011 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 26 Sep 2011 22:51:40 -0700 Subject: [Swift-devel] persistent coasters and data staging In-Reply-To: <1316718730.6443.0.camel@blabla> References: <1315853779.19354.5.camel@blabla> <1315873153.2945.0.camel@blabla> <1316717825.6012.3.camel@blabla> <1316718730.6443.0.camel@blabla> Message-ID: <1317102700.21797.3.camel@blabla> So it might be that the client runs out of buffers and that delays some transfers which causes the timeouts. I need to confirm this, but a quick fix may be to disable timeouts for file transfers. The alternative would be to send some periodic "still queued" message. I'll give this some thought, but suggestions are welcome. Mihael On Thu, 2011-09-22 at 12:12 -0700, Mihael Hategan wrote: > Ah, yes. Sorry. I was looking at the wrong log. > > On Thu, 2011-09-22 at 14:07 -0500, Ketan Maheshwari wrote: > > Mihael, > > > > > > The experiments and logs I sent you above are not from the SCEC > > workflow. These are just the catsn scripts. The logs also doesn't show > > anything related to invalid path as such. > > > > > > The var_str invalid path issue still persists though and I am trying > > to debug it, but that is a completely different one. > > > > > > Regards, > > Ketan > > > > > > On Thu, Sep 22, 2011 at 1:57 PM, Mihael Hategan > > wrote: > > What I see in the log is the error about the invalid path, > > which, as I > > mentioned before, is an issue of var_str seemingly being > > empty (you may > > want to trace its value though to confirm). I don't see > > anything about a > > stagein/out issue. > > > > Mihael > > > > > > On Wed, 2011-09-21 at 16:24 -0500, Ketan Maheshwari wrote: > > > Hi Mihael, > > > > > > > > > I tested this fix. It seems that the timeout issue for > > large-ish data > > > and throttle > ~30 persists. I am not sure if this is data > > staging > > > timeout though. > > > > > > > > > The setup that fails is as follows: > > > > > > > > > persistent coasters, resource= workers running on OSG > > > data size=8MB, 100 data items. > > > foreach throttle=40=jobthrottle. > > > > > > > > > The standard output seems intermittently showing some > > activity and > > > then getting back to no activity without any progress on > > tasks. > > > > > > > > > Please find the log and stdouterr > > > here: > > http://www.ci.uchicago.edu/~ketan/coaster-lab/std.out.err, > > > > > http://www.ci.uchicago.edu/~ketan/coaster-lab/catsn-20110921-1535-v0t3gcg5.log > > > > > > > > > When I tested with small data, 1MB, 2MB, 4MB, it did work. > > 4MB > > > displayed a fat tail behavior though, ~94 tasks completing > > steadily > > > and quickly while the last 5-6 tasks taking disproportionate > > times. > > > The throttle in these cases was <= 30. > > > > > > > > > > > > > > > Regards, > > > Ketan > > > > > > On Mon, Sep 12, 2011 at 7:19 PM, Mihael Hategan > > > > > wrote: > > > Try now please (cog r3262). > > > > > > On Mon, 2011-09-12 at 15:56 -0500, Ketan Maheshwari > > wrote: > > > > > > > > > > Mihael, > > > > > > > > > > > > I tried with the new worker.pl, running a 100 task > > 10MB per > > > task run > > > > with throttle set at 100. > > > > > > > > > > > > However, it seems to have failed with the same > > symptoms of > > > timeout > > > > error 521: > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > > Job > > > > failed with an exit code of 521 > > > > Progress: time: Mon, 12 Sep 2011 15:45:31 -0500 > > > Submitted:53 > > > > Active:1 Failed:46 > > > > Progress: time: Mon, 12 Sep 2011 15:45:34 -0500 > > > Submitted:53 > > > > Active:1 Failed:46 > > > > Exception in cat: > > > > Arguments: > > [gpfs/pads/swift/ketan/indir10/data0002.txt] > > > > Host: grid > > > > Directory: > > catsn-20110912-1521-8jh2gar4/jobs/u/cat-u18visfk > > > > - - - > > > > > > > > > > > > Caused by: null > > > > Caused by: > > > > > > > > > org.globus.cog.abstraction.impl.common.execution.JobException: > > > Job > > > > failed with an exit code of 521 > > > > Progress: time: Mon, 12 Sep 2011 15:45:45 -0500 > > > Submitted:52 > > > > Active:1 Failed:47 > > > > Exception in cat: > > > > Arguments: > > [gpfs/pads/swift/ketan/indir10/data0014.txt] > > > > Host: grid > > > > Directory: > > catsn-20110912-1521-8jh2gar4/jobs/x/cat-x18visfk > > > > > > > > > > > > I had about 107 workers running at the time of > > these > > > failures. > > > > > > > > > > > > I started seeing the failure messages after about > > 20 minutes > > > into this > > > > run. > > > > > > > > > > > > The logs are in > > http://www.ci.uchicago.edu/~ketan/pack.tgz > > > > > > > > > > > > Regards, > > > > Ketan > > > > > > > > > > > > > > > > On Mon, Sep 12, 2011 at 1:56 PM, Mihael Hategan > > > > > > > wrote: > > > > On Mon, 2011-09-12 at 11:58 -0500, Ketan > > Maheshwari > > > wrote: > > > > > > > > > After some discussion with Mike, Our > > conclusion > > > from these > > > > runs was > > > > > that the parallel data transfers are > > causing > > > timeouts from > > > > the > > > > > worker.pl, further, we were undecided if > > somehow > > > the timeout > > > > threshold > > > > > is set too agressive plus how are they > > determined > > > and > > > > whether a change > > > > > in that value could resolve the issue. > > > > > > > > > > > > Something like that. Worker.pl would use > > the time > > > when a file > > > > transfer > > > > started to determine timeouts. This is > > undesirable. > > > The > > > > purpose of > > > > timeouts is to determine whether the other > > side has > > > stopped > > > > from > > > > properly following the flow of things. It > > follows > > > that any > > > > kind of > > > > activity should reset the timeout... > > timer. > > > > > > > > I updated the worker code to deal with the > > issue in > > > a proper > > > > way. But > > > > now I need your help. This is perl code, > > and it > > > needs testing. > > > > > > > > So can you re-run, first with some simple > > test that > > > uses > > > > coaster staging > > > > (just to make sure I didn't mess something > > up), and > > > then the > > > > version of > > > > your tests that was most likely to fail? > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Ketan > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Ketan > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidk at ci.uchicago.edu Tue Sep 27 14:38:25 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Tue, 27 Sep 2011 14:38:25 -0500 (CDT) Subject: [Swift-devel] Persistent coasters hang In-Reply-To: <1444355011.122065.1317151395744.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1172691639.122091.1317152305859.JavaMail.root@zimbra-mb2.anl.gov> Hello, I've been trying to reproduce the issue with the coaster channel commands timing out, so I've been adding some new tests to the test suite. I added a test called tests/stress/persistent-coasters/many-jobs. It starts the persistent coaster service and then runs catsn 1000 times. When I tried running it today it froze on the 89th test. I saw there was a null pointer exception in the log file, and jstack revealed a deadlock. I am using the latest 0.93. Links to the logs below: http://www.ci.uchicago.edu/~davidk/stress/jstack.log http://www.ci.uchicago.edu/~davidk/stress/catsn-140918/catsn-20110927-1409-krck1526.log Thanks, David From davidk at ci.uchicago.edu Wed Sep 28 14:56:53 2011 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 28 Sep 2011 14:56:53 -0500 (CDT) Subject: [Swift-devel] Test suite modifications In-Reply-To: <860278908.123942.1317238242686.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: <1801271411.124001.1317239813990.JavaMail.root@zimbra-mb2.anl.gov> Hello, I just wanted to let everyone know about some changes I've made to the test suite over the last few days. You can now run a test multiple times by using a scriptname.repeat file. The file contains a number representing the number of times you would like the script to repeat Each script now runs from it's own directory. The directory is run-YYYY-MM-DD/script-timestamp and includes all input, output, and configuration files used. This also fixes a potential problem where a test could incorrectly pass by looking at output files from a previous script. The test suite now uses gensites. suite.sh will look first for a sites.template.xml file in the group directory. If that file exists, it will now be parsed by gensites rather than a local sed function. You can also now use a file called gensites.template. The contents of gensites.template is the name of a gensites template (ie, "pads", "local", "ranger"). This will be used by the provider tests. There is now a test group for stress testing. It's located in tests/stress. The groupfile is groups/group-stress.sh. I have started with some tests for persistent coasters to help track down a bug, but feel free to add others here. David From iraicu at cs.iit.edu Fri Sep 30 00:38:05 2011 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 30 Sep 2011 00:38:05 -0500 Subject: [Swift-devel] Call for Papers and Workshops at HPDC 2012 -- The 21st International ACM Symposium on High-Performance Parallel and Distributed Computing 2012 Message-ID: <4E8555BD.9090804@cs.iit.edu> **** CALL FOR PAPERS **** **** CALL FOR WORKSHOP PROPOSALS **** The 21st International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'12) Delft University of Technology, Delft, the Netherlands June 18-22, 2012 http://www.hpdc.org/2012 The ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) is the premier annual conference on the design, the implementation, the evaluation, and the use of parallel and distributed systems for high-end computing. HPDC'12 will take place in Delft, the Netherlands, a historical, picturesque city that is less than one hour away from Amsterdam-Schiphol airport. The conference will be held on June 20-22 (Wednesday to Friday), with affiliated workshops taking place on June 18-19 (Monday and Tuesday). **** SUBMISSION DEADLINES **** Abstracts: 16 January 2012 Papers: 23 January 2012 (No extensions!) **** HPDC'12 GENERAL CHAIR **** Dick Epema, Delft University of Technology, Delft, the Netherlands **** HPDC'12 PROGRAM CO-CHAIRS **** Thilo Kielmann, Vrije Universiteit, Amsterdam, the Netherlands Matei Ripeanu, The University of British Columbia, Vancouver, Canada **** HPDC'12 WORKSHOPS CHAIR **** Alexandru Iosup, Delft University of Technology, Delft, the Netherlands **** SCOPE AND TOPICS **** Submissions are welcomed on all forms of high-performance parallel and distributed computing, including but not limited to clusters, clouds, grids, utility computing, data-intensive computing, and massively multicore systems. Submissions that explore solutions to estimate and reduce the energy footprint of such systems are particularly encouraged. All papers will be evaluated for their originality, potential impact, correctness, quality of presentation, appropriate presentation of related work, and relevance to the conference, with a strong preference for rigorous results obtained in operational parallel and distributed systems. The topics of interest of the conference include, but are not limited to, the following, in the context of high-performance parallel and distributed computing: - Systems, networks, and architectures for high-end computing - Massively multicore systems - Virtualization of machines, networks, and storage - Programming languages and environments - I/O, storage systems, and data management - Resource management, energy and cost minimizations - Performance modeling and analysis - Fault tolerance, reliability, and availability - Data-intensive computing - Applications of parallel and distributed computing **** PAPER SUBMISSION GUIDELINES **** Authors are invited to submit technical papers of at most 12 pages in PDF format, including figures and references. Papers should be formatted in the ACM Proceedings Style and submitted via the conference web site. No changes to the margins, spacing, or font sizes as specified by the style file are allowed. Accepted papers will appear in the conference proceedings, and will be incorporated into the ACM Digital Library. A limited number of papers will be accepted as posters. Papers must be self-contained and provide the technical substance required for the program committee to evaluate their contributions. Submitted papers must be original work that has not appeared in and is not under consideration for another conference or a journal. See the ACM Prior Publication Policy for more details. **** IMPORTANT DATES **** Workshop Proposals Due: 3 October 2011 Abstracts Due: 16 January 2012 Papers Due: 23 January 2012 (No extensions!) Reviews Released to Authors: 8 March 2012 Author Rebuttals Due: 12 March 2012 Author Notifications: 19 March 2012 Final Papers Due: 16 April 2012 Conference Dates: 18-22 June 2012 **** CALL FOR WORKSHOP PROPOSALS **** Workshops affiliated with HPDC will be held on June 18-19 (Monday and Tuesday). For more information on the workshops and for the complete Call for Workshop Proposals, see the workshops page on the conference website http://www.hpdc.org/2012/workshops/call-for-workshops/. Workshops should provide forums for discussion among researchers and practitioners on focused topics or emerging research areas. Organizers may structure workshops as they see fit, possibly including invited talks, panel discussions, presentations of work in progress, fully peer-reviewed papers, or some combination. Workshops could be scheduled for a half day or a full day, depending on interest, space constraints, and organizer preference. Organizers should design workshops for approximately 20-40 participants, to balance impact and effective discussion. Workshop proposals must be sent to the HPDC'12 Workshops Chair, Alexandru Iosup, at a.iosup at tudelft.nl, and should include: - The name and acronym of the workshop - A description (0.5-1 page) of the theme of the workshop - A description (one paragraph) of the relation between the theme of the workshop and of HPDC - A list of topics of interest - The names and affiliations of the workshop organizers, and if applicable, of a significant portion of the program committee - Data about previous offerings of the workshop (if any), including the attendance, the numbers of papers or presentations submitted and accepted, and the links to the corresponding websites - A publicity plan for attracting submissions and attendees Due to publication deadlines, workshops must operate within roughly the following timeline: papers due mid February (2-3 weeks after the HPDC deadline), and selected and sent to the publisher by mid April. Important dates: Workshop Proposals Due: 3 October 2011 Notifications: 14 October 2011 Workshop CFPs Online and Distributed: 7 November 2011 -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= =================================================================