[Swift-devel] swift 0.93 deadlock

Mihael Hategan hategan at mcs.anl.gov
Sat Sep 17 23:36:25 CDT 2011


I have a tentative fix in the branch and trunk. Revisions 5123 and 5124,
respectively. Please let me know how that works out.

Mihae

On Fri, 2011-09-16 at 11:50 -0500, Michael Wilde wrote:
> David and Papia, can you report to the list what the status is of running the SWAT app?
> 
> - I understand that Mihael will work on the 0.93 deadlock fix this weekend, which is great.
> 
> - I understand that its happening on trunk as well
> 
> - Papia, can you try to "perturb" the Swift code in the hopes that some equivalent but different code doesnt trip into the same bug? Ie try a different mapper, different variable strategy (ie arrays vs scalars, structs vs separate vars) just to see if you can work around this?  Or, put in some shell logic to catch the hang and kill and re-run (or resume) Swift?  if you just kill a hung script and then resume it, will it work?  We could maybe alter the hang checker to kill swift on its own, with a return code or message that you could use to trigger a resume.
> 
> Mike
> 
> 
> ----- Original Message -----
> > From: "David Kelly" <davidk at ci.uchicago.edu>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>, "Papia Rizwan" <papia.rizwan at gmail.com>
> > Sent: Thursday, September 15, 2011 4:34:02 PM
> > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > I was able to get it running on PADS with trunk. I ran into the same
> > issue.
> > 
> > http://www.ci.uchicago.edu/~davidk/swat3/jstack.log
> > http://www.ci.uchicago.edu/~davidk/swat3/cce_ua-20110915-1617-sd4svyo2.log
> > 
> > David
> > 
> > ----- Original Message -----
> > > From: "David Kelly" <davidk at ci.uchicago.edu>
> > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>, "Papia
> > > Rizwan" <papia.rizwan at gmail.com>
> > > Sent: Thursday, September 15, 2011 2:39:47 PM
> > > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > > The sites.xml in /homes/papia/SwiftSCE2 seems to be using passive
> > > persistent coasters. Is there a way to use automatic coasters on the
> > > MCS workstations? I'll try copying this over to PADS and running
> > > there
> > > to see if I can reproduce it.
> > >
> > > David
> > >
> > > ----- Original Message -----
> > > > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > To: "David Kelly" <davidk at ci.uchicago.edu>
> > > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>, "Papia
> > > > Rizwan" <papia.rizwan at gmail.com>, "Mihael Hategan"
> > > > <hategan at mcs.anl.gov>
> > > > Sent: Thursday, September 15, 2011 2:18:17 PM
> > > > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > > > Can you make SWAT run under trunk, as Papia is testing using
> > > > standard
> > > > auto coasters, and doesnt need any of the missing coaster-service
> > > > options.
> > > >
> > > > - Mike
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > From: "David Kelly" <davidk at ci.uchicago.edu>
> > > > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>, "Papia
> > > > > Rizwan" <papia.rizwan at gmail.com>, "Mihael Hategan"
> > > > > <hategan at mcs.anl.gov>
> > > > > Sent: Thursday, September 15, 2011 2:15:36 PM
> > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > > > > I got past the compilation errors by renaming the all functions
> > > > > with
> > > > > capitalization, but ran into an issue with coaster-service. Last
> > > > > week
> > > > > I noticed coaster-service was missing options for dynamic ports.
> > > > > I
> > > > > found today that it is also missing -passive. I'll try to track
> > > > > down
> > > > > where this changed and restore the previous version.
> > > > >
> > > > > David
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > > To: "David Kelly" <davidk at ci.uchicago.edu>
> > > > > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>, "Papia
> > > > > > Rizwan" <papia.rizwan at gmail.com>, "Mihael Hategan"
> > > > > > <hategan at mcs.anl.gov>
> > > > > > Sent: Thursday, September 15, 2011 12:37:13 PM
> > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > > > > > Excellent, thanks - thats good. I also just verified that
> > > > > > Papia
> > > > > > is
> > > > > > not
> > > > > > using the overAllocation tags in the sites file, so this
> > > > > > problem
> > > > > > is
> > > > > > clearly a Java deadlock and has nothing to do with the
> > > > > > scheduling
> > > > > > problem that the (now fixed) overAllocation problem was
> > > > > > causing..
> > > > > >
> > > > > > My understanding is that this SWAT script is failing under
> > > > > > trunk
> > > > > > because of the recent token case handling issue (I think the
> > > > > > camel-case one). Can you work with Papia to see if either that
> > > > > > issue
> > > > > > is now fixed, or if her script can be changed to avoid that,
> > > > > > so
> > > > > > that
> > > > > > you can both test the SWAT script with trunk, to see if the
> > > > > > deadlock
> > > > > > still occurs?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > - MIke
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "David Kelly" <davidk at ci.uchicago.edu>
> > > > > > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>,
> > > > > > > "Papia
> > > > > > > Rizwan" <papia.rizwan at gmail.com>, "Mihael Hategan"
> > > > > > > <hategan at mcs.anl.gov>
> > > > > > > Sent: Thursday, September 15, 2011 12:29:03 PM
> > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > > > > > > I narrowed down the problem a bit. Last night I ran jstack
> > > > > > > on
> > > > > > > the
> > > > > > > wrong java process which is why it didn't report a deadlock.
> > > > > > >
> > > > > > > Papia and I are seeing the same issue.
> > > > > > >
> > > > > > > My jstack:
> > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/jstack.log
> > > > > > > Papia's jstack:
> > > > > > > http://www.ci.uchicago.edu/~davidk/swat2/papia-jstack.log
> > > > > > >
> > > > > > > It happens in the same place:
> > > > > > >
> > > > > > > org.griphyn.vdl.karajan.lib.cache.File.lock(File.java:100)
> > > > > > > org.griphyn.vdl.karajan.lib.cache.LRUFileCache.addAndLockEntry(LRUFileCache.java:24)
> > > > > > >
> > > > > > > Filed as bug #559
> > > > > > >
> > > > > > > David
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > > > > To: "David Kelly" <davidk at ci.uchicago.edu>
> > > > > > > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>,
> > > > > > > > "Papia
> > > > > > > > Rizwan" <papia.rizwan at gmail.com>, "Mihael Hategan"
> > > > > > > > <hategan at mcs.anl.gov>
> > > > > > > > Sent: Thursday, September 15, 2011 11:46:59 AM
> > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > > > > > > > David, it sounds like more analysis is needed here. If the
> > > > > > > > SWAT
> > > > > > > > runs
> > > > > > > > are not showing a deadlock (but your runs are) then likely
> > > > > > > > we
> > > > > > > > have
> > > > > > > > two
> > > > > > > > different problems here.
> > > > > > > >
> > > > > > > > Another case we saw in 0.93 with scripts failing to
> > > > > > > > progress
> > > > > > > > is
> > > > > > > > due
> > > > > > > > to
> > > > > > > > the overAllocation parameter problem that Mihael fixed
> > > > > > > > yesterday.
> > > > > > > > The
> > > > > > > > symptom there is that Swift starts a coaster with a time
> > > > > > > > slot
> > > > > > > > too
> > > > > > > > small for the apps in the script, and no apps wind up
> > > > > > > > running.
> > > > > > > > I
> > > > > > > > think
> > > > > > > > that situation in general merits a separate ticket, and
> > > > > > > > may
> > > > > > > > have
> > > > > > > > been
> > > > > > > > discussed on swift-devel (but quite a while ago).
> > > > > > > >
> > > > > > > > Can you determine if indeed Papia's SWAT runs are hanging
> > > > > > > > for
> > > > > > > > a
> > > > > > > > reason
> > > > > > > > other than a Java deadlock?
> > > > > > > >
> > > > > > > > - Mike
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > > From: "David Kelly" <davidk at ci.uchicago.edu>
> > > > > > > > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > > > > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>,
> > > > > > > > > "Papia
> > > > > > > > > Rizwan" <papia.rizwan at gmail.com>, "Mihael Hategan"
> > > > > > > > > <hategan at mcs.anl.gov>
> > > > > > > > > Sent: Thursday, September 15, 2011 8:03:09 AM
> > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > > > > > > > > The jstack log corresponds to the most recent log file -
> > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/cce_ua-20110914-1934-frd3thja.log.
> > > > > > > > > jstack does not report any deadlocks, but I thought it
> > > > > > > > > might
> > > > > > > > > be
> > > > > > > > > useful
> > > > > > > > > so I included it. Swift was not making any progress for
> > > > > > > > > about
> > > > > > > > > 5
> > > > > > > > > hours
> > > > > > > > > before I sent the logs. I am running the latest 0.93
> > > > > > > > > branch.
> > > > > > > > > I
> > > > > > > > > will
> > > > > > > > > try again today.
> > > > > > > > >
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > > ----- Original Message -----
> > > > > > > > > > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > > > > > > To: "David Kelly" <davidk at ci.uchicago.edu>
> > > > > > > > > > Cc: "swift-devel Devel" <swift-devel at ci.uchicago.edu>,
> > > > > > > > > > "Papia
> > > > > > > > > > Rizwan" <papia.rizwan at gmail.com>, "Mihael Hategan"
> > > > > > > > > > <hategan at mcs.anl.gov>
> > > > > > > > > > Sent: Thursday, September 15, 2011 5:54:11 AM
> > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > > > > > > > > > David, which of the many Swift logs in that /swat dir
> > > > > > > > > > does
> > > > > > > > > > the
> > > > > > > > > > jstack.log pertain to? How many of these runs
> > > > > > > > > > deadlocked?
> > > > > > > > > >
> > > > > > > > > > And, did you verify that you (and Papia) are running
> > > > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > latest
> > > > > > > > > > rev
> > > > > > > > > > of the 0.93 branch?
> > > > > > > > > >
> > > > > > > > > > - Mike
> > > > > > > > > >
> > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > From: "David Kelly" <davidk at ci.uchicago.edu>
> > > > > > > > > > > To: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > > > > > > > > > Cc: "swift-devel Devel"
> > > > > > > > > > > <swift-devel at ci.uchicago.edu>,
> > > > > > > > > > > "Papia
> > > > > > > > > > > Rizwan" <papia.rizwan at gmail.com>, "Michael Wilde"
> > > > > > > > > > > <wilde at mcs.anl.gov>
> > > > > > > > > > > Sent: Wednesday, September 14, 2011 11:04:41 PM
> > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > > > > > > > > > > I was able to reproduce the problem with persistent
> > > > > > > > > > > coasters
> > > > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > MCS servers.
> > > > > > > > > > >
> > > > > > > > > > > The jstack output is at
> > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat/jstack.log
> > > > > > > > > > >
> > > > > > > > > > > The full collection of logs are at
> > > > > > > > > > > http://www.ci.uchicago.edu/~davidk/swat.
> > > > > > > > > > >
> > > > > > > > > > > David
> > > > > > > > > > >
> > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > From: "Mihael Hategan" <hategan at mcs.anl.gov>
> > > > > > > > > > > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > > > > > > > > Cc: "swift-devel Devel"
> > > > > > > > > > > > <swift-devel at ci.uchicago.edu>,
> > > > > > > > > > > > "Papia
> > > > > > > > > > > > Rizwan" <papia.rizwan at gmail.com>
> > > > > > > > > > > > Sent: Wednesday, September 14, 2011 10:30:48 PM
> > > > > > > > > > > > Subject: Re: [Swift-devel] swift 0.93 deadlock
> > > > > > > > > > > > Could you also forward the attachments please?
> > > > > > > > > > > >
> > > > > > > > > > > > Mihael
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 2011-09-14 at 14:46 -0500, Michael Wilde
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > I think I am seeing a similar deadlock on 0.93
> > > > > > > > > > > > > in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > ParVis
> > > > > > > > > > > > > script,
> > > > > > > > > > > > > and am trying to get a clean log and jstack to
> > > > > > > > > > > > > confirm.
> > > > > > > > > > > > >
> > > > > > > > > > > > > As far as I can tell, Papia is running the
> > > > > > > > > > > > > correct
> > > > > > > > > > > > > 0.93
> > > > > > > > > > > > > code,
> > > > > > > > > > > > > but
> > > > > > > > > > > > > please verify.
> > > > > > > > > > > > >
> > > > > > > > > > > > > David will try to replicate this problem as
> > > > > > > > > > > > > well.
> > > > > > > > > > > > >
> > > > > > > > > > > > > - Mike
> > > > > > > > > > > > >
> > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > > > From: "Papia Rizwan" <papia.rizwan at gmail.com>
> > > > > > > > > > > > > > To: "swift-devel Devel"
> > > > > > > > > > > > > > <swift-devel at ci.uchicago.edu>,
> > > > > > > > > > > > > > "Michael
> > > > > > > > > > > > > > Wilde" <wilde at mcs.anl.gov>, "Michael P.
> > > > > > > > > > > > > > Shields"
> > > > > > > > > > > > > > <mpshields at anl.gov>
> > > > > > > > > > > > > > Sent: Wednesday, September 14, 2011 1:56:13 PM
> > > > > > > > > > > > > > Subject: swift 0.93 deadlock
> > > > > > > > > > > > > > Attached are the jstack output and the log
> > > > > > > > > > > > > > file.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Papia Rizwan
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > _______________________________________________
> > > > > > > > > > > > Swift-devel mailing list
> > > > > > > > > > > > Swift-devel at ci.uchicago.edu
> > > > > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Michael Wilde
> > > > > > > > > > Computation Institute, University of Chicago
> > > > > > > > > > Mathematics and Computer Science Division
> > > > > > > > > > Argonne National Laboratory
> > > > > > > >
> > > > > > > > --
> > > > > > > > Michael Wilde
> > > > > > > > Computation Institute, University of Chicago
> > > > > > > > Mathematics and Computer Science Division
> > > > > > > > Argonne National Laboratory
> > > > > >
> > > > > > --
> > > > > > Michael Wilde
> > > > > > Computation Institute, University of Chicago
> > > > > > Mathematics and Computer Science Division
> > > > > > Argonne National Laboratory
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 





More information about the Swift-devel mailing list