[Swift-devel] Re: Swift hanging in complex iterate script

Michael Wilde wilde at mcs.anl.gov
Thu Sep 16 20:09:48 CDT 2010


With r3628, Ive run 100 passes of the R tests successfully. Before, it was hanging are less than 10 passes.

- Mike


----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> Well, so concurrency and all, it's a hairy issue.
> 
> We do seem to synchronize on stuff liberally. I tried to remove what
> I
> thought were some unnecessary synchronizations (and I actually had
> these
> removed in my local copy a while ago - and I also think they are in
> the
> fast branch).
> 
> These are committed in swift r3628.
> 
> But reduced synchronizations may lead to other bad things, and while
> I
> tried to avoid that, it is concurrency we're talking about.
> 
> Mihael
> 
> On Thu, 2010-09-16 at 13:26 -0600, wilde at mcs.anl.gov wrote:
> > ~wilde/swiftrhang/rserver.swift
> > 
> > sites.xml, tc, and properties are in the same dir.
> > 
> > Launched swift from ~wilde/SwiftR/Swift/exec/RunRServer.sh:
> > 
> > swift -config swift.properties -tc.file tc -sites.file sites.xml
> $script \
> >  >& swift.stdouterr
> > 
> > 
> > - Mike
> > 
> > ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> > 
> > > Can you point me to the swift script?
> > > 
> > > Mihael
> > > 
> > > On Thu, 2010-09-16 at 13:13 -0600, Michael Wilde wrote:
> > > > I see this suspicious deadlock in that jstack output:
> > > > 
> > > > Found one Java-level deadlock:
> > > > =============================
> > > > "pool-1-thread-4":
> > > >   waiting to lock monitor 0x0000000052b88650 (object
> > > 0x00002aaab5542348, a org.griphyn.vdl.mapping.RootDataNode),
> > > >   which is held by "pool-1-thread-2"
> > > > "pool-1-thread-2":
> > > >   waiting to lock monitor 0x0000000052b879d8 (object
> > > 0x00002aaab5542540, a org.griphyn.vdl.mapping.RootArrayDataNode),
> > > >   which is held by "pool-1-thread-4"
> > > > 
> > > > Java stack information for the threads listed above:
> > > > ===================================================
> > > > ...
> > > > 
> > > > - Mike
> > > > 
> > > > ----- "Michael Wilde" <wilde at mcs.anl.gov> wrote:
> > > > 
> > > > > OK, thats in ~wilde/swiftrhang/jstack.out
> > > > > 
> > > > > The jvm is still running: pid 22435 on bridled. Let me know
> if
> > > you
> > > > > need any other traces from it.
> > > > > 
> > > > > It hung around 12:05
> > > > > 
> > > > > - Mike
> > > > > 
> > > > > 
> > > > > bri$ mp
> > > > > UID        PID  PPID  PGID   SID  C STIME TTY          TIME
> CMD
> > > > > wilde    19025 19023 19023 19023  0 11:49 ?        00:00:00
> sshd:
> > > > > wilde at pts/29
> > > > > wilde    19026 19025 19026 19026  0 11:49 pts/29   00:00:00  
> > > -bash
> > > > > wilde    19203 19026 19203 19026  0 11:49 pts/29   00:00:00   
> 
> > > > > /usr/bin/screen -x
> > > > > wilde    15373 15371 15371 15371  0 06:58 ?        00:00:00
> sshd:
> > > > > wilde at pts/14
> > > > > wilde    15374 15373 15374 15374  0 06:58 pts/14   00:00:00  
> > > -bash
> > > > > wilde    18291 15374 18291 15374  0 11:47 pts/14   00:00:00   
> 
> > > > > /home/wilde/R/R-2.11.0/bin/exec/R
> > > > > wilde    11315 15374 11315 15374  0 13:59 pts/14   00:00:00   
> 
> > > > > /usr/bin/screen -x
> > > > > wilde    15183 15181 15181 15181  0 06:58 ?        00:00:00
> sshd:
> > > > > wilde at pts/11
> > > > > wilde    15184 15183 15184 15184  0 06:58 pts/11   00:00:00  
> > > -bash
> > > > > wilde    15835 15184 15835 15184  0 06:59 pts/11   00:00:00   
> 
> > > > > /usr/bin/screen
> > > > > wilde    15836 15835 15836 15836  0 06:59 ?        00:00:00   
>   
> > > > > /usr/bin/SCREEN
> > > > > wilde    15837 15836 15837 15837  0 06:59 pts/15   00:00:00   
>    
> > > 
> > > > > bash
> > > > > wilde    15839 15836 15839 15839  0 06:59 pts/16   00:00:00   
>    
> > > 
> > > > > bash
> > > > > wilde    15841 15836 15841 15841  0 06:59 pts/19   00:00:00   
>    
> > > 
> > > > > bash
> > > > > wilde    15843 15836 15843 15843  0 06:59 pts/20   00:00:00   
>    
> > > 
> > > > > bash
> > > > > wilde    15845 15836 15845 15845  0 06:59 pts/21   00:00:00   
>    
> > > 
> > > > > bash
> > > > > wilde    15847 15836 15847 15847  0 06:59 pts/22   00:00:00   
>    
> > > 
> > > > > bash
> > > > > wilde    14845 15847 14845 15847  0 10:05 pts/22   00:00:00   
>    
> > >   
> > > > > emacs Swift/exec/start-swift-workers
> > > > > wilde    22364 15847 22364 15847  0 11:59 pts/22   00:00:00   
>    
> > >   
> > > > > /bin/bash Swift/exec/RunRServer.sh
> > > > > wilde    22378 22364 22364 15847  0 11:59 pts/22   00:00:00   
>    
> > >    
> > > > >  /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config
> swift.pr
> > > > > wilde    22435 22378 22364 15847  0 11:59 pts/22   00:00:29   
>    
> > >    
> > > > >    java -Xmx256M
> -Djava.endorsed.dirs=/home/wilde/swift/rev/tru
> > > > > wilde    12506 15847 12506 15847  0 14:03 pts/22   00:00:00   
>    
> > >   
> > > > > ps -fjH -u wilde
> > > > > wilde    22930     1 22364 15847  0 11:59 pts/22   00:00:05
> > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore
> > > > > --file=./SwiftRServ
> > > > > wilde    22926     1 22364 15847  0 11:59 pts/22   00:00:04
> > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore
> > > > > --file=./SwiftRServ
> > > > > wilde    22920     1 22364 15847  0 11:59 pts/22   00:00:04
> > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore
> > > > > --file=./SwiftRServ
> > > > > wilde    22760     1 22364 15847  0 11:59 pts/22   00:00:04
> > > > > /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore
> > > > > --file=./SwiftRServ
> > > > > bri$
> > > > > 
> > > > > 
> > > > > ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> > > > > 
> > > > > > I can't tell what's causing the problem, but it may
> generally be
> > > a
> > > > > > good
> > > > > > idea to do a jstack -l when you get a hang.
> > > > > >
> > > > > > Mihael
> > > > > >
> > > > > > On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote:
> > > > > > > Mihael,
> > > > > > >
> > > > > > > I've developed a Swift script that loops using iterate,
> > > reading
> > > > > > requests to process an R function from a named pipe (fifo),
> > > calling
> > > > > R,
> > > > > > and replying "done" on a response fifo.
> > > > > > >
> > > > > > > This has been working very well, but I just hit a case
> where
> > > the
> > > > > > script hangs.
> > > > > > >
> > > > > > > I exercise it using a small battery of R tests; I was
> > > manually
> > > > > > restarting the test battery (which does hundreds of R calls
> in
> > > 30
> > > > > > seconds or so, when it hung in the middle of the test
> suite.
> > > > > > >
> > > > > > > As far as I can tell it hung after receiving a work
> request,
> > > > > mapping
> > > > > > the files for the work request, but never called the app()
> > > function
> > > > > > that invokes R.
> > > > > > >
> > > > > > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log
> > > > > > > (I will try to post the script and all related files, but
> > > looks
> > > > > like
> > > > > > bridled may have just gone down for patches)
> > > > > > >
> > > > > > > look for these trace lines in the log, which are issued
> at
> > > the
> > > > > start
> > > > > > of every R request:
> > > > > > >
> > > > > > > line 37342:
> > > > > > >
> > > > > > > 2010-09-16 12:03:24,212-0500 INFO  vdl:execute
> END_SUCCESS
> > > > > > thread=0-1-86-4 tr=bash
> > > > > > > 2010-09-16 12:03:24,213-0500 INFO  apply STARTCOMPOUND
> > > > > > thread=0-1-87-2 name=apply
> > > > > > > 2010-09-16 12:03:24,215-0500 WARN  trace SwiftScript
> trace:
> > > > > rserver:
> > > > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233
> > > > > > > 2010-09-16 12:03:24,215-0500 INFO  SetFieldValue Set:
> > > done=false
> > > > > > >
> > > > > > > The END_SUCCESS is the completion of the last app() in
> the
> > > prior
> > > > > > iterate pass, which signals the response ("done") fifo using
> a
> > > shell
> > > > > > script.
> > > > > > >
> > > > > > > The trace says its starting to process the next R
> request,
> > > #233
> > > > > > (randomly assigned)
> > > > > > >
> > > > > > > after mapping 20 files (for 5 R datasets containing 2 R
> > > evaluation
> > > > > > requests each)
> > > > > > >
> > > > > > > it just hangs, and all I see in the log after that point
> is
> > > > > coaster
> > > > > > heartbeats.
> > > > > > >
> > > > > > > The last request prior to this hanging request is in the
> log
> > > at
> > > > > line
> > > > > > 37137:
> > > > > > >
> > > > > > > 2010-09-16 12:03:24,060-0500 INFO  vdl:execute
> END_SUCCESS
> > > > > > thread=0-1-85-4 tr=bash
> > > > > > > 2010-09-16 12:03:24,062-0500 INFO  apply STARTCOMPOUND
> > > > > > thread=0-1-86-2 name=apply
> > > > > > > 2010-09-16 12:03:24,062-0500 WARN  trace SwiftScript
> trace:
> > > > > rserver:
> > > > > > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174
> > > > > > > 2010-09-16 12:03:24,062-0500 INFO  SetFieldValue Set:
> > > done=false
> > > > > > >
> > > > > > > R request #174 (and all prior ones) completed fine, and
> > > should
> > > > > > illustrate the normal processing sequence.
> > > > > > >
> > > > > > > Any ideas on what to look for regarding the cause of the
> > > hang?
> > > > > > >
> > > > > > > I will try to reproduce it and try to get a karajan
> status
> > > trace
> > > > > > using swift stdin.
> > > > > > >
> > > > > > > - Mike
> > > > > > >
> > > > > 
> > > > > --
> > > > > Michael Wilde
> > > > > Computation Institute, University of Chicago
> > > > > Mathematics and Computer Science Division
> > > > > Argonne National Laboratory
> > > >
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list