[Swift-devel] Re: Swift hanging in complex iterate script
Michael Wilde
wilde at mcs.anl.gov
Thu Sep 16 14:13:37 CDT 2010
I see this suspicious deadlock in that jstack output:
Found one Java-level deadlock:
=============================
"pool-1-thread-4":
waiting to lock monitor 0x0000000052b88650 (object 0x00002aaab5542348, a org.griphyn.vdl.mapping.RootDataNode),
which is held by "pool-1-thread-2"
"pool-1-thread-2":
waiting to lock monitor 0x0000000052b879d8 (object 0x00002aaab5542540, a org.griphyn.vdl.mapping.RootArrayDataNode),
which is held by "pool-1-thread-4"
Java stack information for the threads listed above:
===================================================
...
- Mike
----- "Michael Wilde" <wilde at mcs.anl.gov> wrote:
> OK, thats in ~wilde/swiftrhang/jstack.out
>
> The jvm is still running: pid 22435 on bridled. Let me know if you
> need any other traces from it.
>
> It hung around 12:05
>
> - Mike
>
>
> bri$ mp
> UID PID PPID PGID SID C STIME TTY TIME CMD
> wilde 19025 19023 19023 19023 0 11:49 ? 00:00:00 sshd:
> wilde at pts/29
> wilde 19026 19025 19026 19026 0 11:49 pts/29 00:00:00 -bash
> wilde 19203 19026 19203 19026 0 11:49 pts/29 00:00:00
> /usr/bin/screen -x
> wilde 15373 15371 15371 15371 0 06:58 ? 00:00:00 sshd:
> wilde at pts/14
> wilde 15374 15373 15374 15374 0 06:58 pts/14 00:00:00 -bash
> wilde 18291 15374 18291 15374 0 11:47 pts/14 00:00:00
> /home/wilde/R/R-2.11.0/bin/exec/R
> wilde 11315 15374 11315 15374 0 13:59 pts/14 00:00:00
> /usr/bin/screen -x
> wilde 15183 15181 15181 15181 0 06:58 ? 00:00:00 sshd:
> wilde at pts/11
> wilde 15184 15183 15184 15184 0 06:58 pts/11 00:00:00 -bash
> wilde 15835 15184 15835 15184 0 06:59 pts/11 00:00:00
> /usr/bin/screen
> wilde 15836 15835 15836 15836 0 06:59 ? 00:00:00
> /usr/bin/SCREEN
> wilde 15837 15836 15837 15837 0 06:59 pts/15 00:00:00
> bash
> wilde 15839 15836 15839 15839 0 06:59 pts/16 00:00:00
> bash
> wilde 15841 15836 15841 15841 0 06:59 pts/19 00:00:00
> bash
> wilde 15843 15836 15843 15843 0 06:59 pts/20 00:00:00
> bash
> wilde 15845 15836 15845 15845 0 06:59 pts/21 00:00:00
> bash
> wilde 15847 15836 15847 15847 0 06:59 pts/22 00:00:00
> bash
> wilde 14845 15847 14845 15847 0 10:05 pts/22 00:00:00
> emacs Swift/exec/start-swift-workers
> wilde 22364 15847 22364 15847 0 11:59 pts/22 00:00:00
> /bin/bash Swift/exec/RunRServer.sh
> wilde 22378 22364 22364 15847 0 11:59 pts/22 00:00:00
> /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config swift.pr
> wilde 22435 22378 22364 15847 0 11:59 pts/22 00:00:29
> java -Xmx256M -Djava.endorsed.dirs=/home/wilde/swift/rev/tru
> wilde 12506 15847 12506 15847 0 14:03 pts/22 00:00:00
> ps -fjH -u wilde
> wilde 22930 1 22364 15847 0 11:59 pts/22 00:00:05
> /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore
> --file=./SwiftRServ
> wilde 22926 1 22364 15847 0 11:59 pts/22 00:00:04
> /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore
> --file=./SwiftRServ
> wilde 22920 1 22364 15847 0 11:59 pts/22 00:00:04
> /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore
> --file=./SwiftRServ
> wilde 22760 1 22364 15847 0 11:59 pts/22 00:00:04
> /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore
> --file=./SwiftRServ
> bri$
>
>
> ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
>
> > I can't tell what's causing the problem, but it may generally be a
> > good
> > idea to do a jstack -l when you get a hang.
> >
> > Mihael
> >
> > On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote:
> > > Mihael,
> > >
> > > I've developed a Swift script that loops using iterate, reading
> > requests to process an R function from a named pipe (fifo), calling
> R,
> > and replying "done" on a response fifo.
> > >
> > > This has been working very well, but I just hit a case where the
> > script hangs.
> > >
> > > I exercise it using a small battery of R tests; I was manually
> > restarting the test battery (which does hundreds of R calls in 30
> > seconds or so, when it hung in the middle of the test suite.
> > >
> > > As far as I can tell it hung after receiving a work request,
> mapping
> > the files for the work request, but never called the app() function
> > that invokes R.
> > >
> > > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log
> > > (I will try to post the script and all related files, but looks
> like
> > bridled may have just gone down for patches)
> > >
> > > look for these trace lines in the log, which are issued at the
> start
> > of every R request:
> > >
> > > line 37342:
> > >
> > > 2010-09-16 12:03:24,212-0500 INFO vdl:execute END_SUCCESS
> > thread=0-1-86-4 tr=bash
> > > 2010-09-16 12:03:24,213-0500 INFO apply STARTCOMPOUND
> > thread=0-1-87-2 name=apply
> > > 2010-09-16 12:03:24,215-0500 WARN trace SwiftScript trace:
> rserver:
> > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233
> > > 2010-09-16 12:03:24,215-0500 INFO SetFieldValue Set: done=false
> > >
> > > The END_SUCCESS is the completion of the last app() in the prior
> > iterate pass, which signals the response ("done") fifo using a shell
> > script.
> > >
> > > The trace says its starting to process the next R request, #233
> > (randomly assigned)
> > >
> > > after mapping 20 files (for 5 R datasets containing 2 R evaluation
> > requests each)
> > >
> > > it just hangs, and all I see in the log after that point is
> coaster
> > heartbeats.
> > >
> > > The last request prior to this hanging request is in the log at
> line
> > 37137:
> > >
> > > 2010-09-16 12:03:24,060-0500 INFO vdl:execute END_SUCCESS
> > thread=0-1-85-4 tr=bash
> > > 2010-09-16 12:03:24,062-0500 INFO apply STARTCOMPOUND
> > thread=0-1-86-2 name=apply
> > > 2010-09-16 12:03:24,062-0500 WARN trace SwiftScript trace:
> rserver:
> > got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174
> > > 2010-09-16 12:03:24,062-0500 INFO SetFieldValue Set: done=false
> > >
> > > R request #174 (and all prior ones) completed fine, and should
> > illustrate the normal processing sequence.
> > >
> > > Any ideas on what to look for regarding the cause of the hang?
> > >
> > > I will try to reproduce it and try to get a karajan status trace
> > using swift stdin.
> > >
> > > - Mike
> > >
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list