[Swift-devel] Re: Swift hanging in complex iterate script

Michael Wilde wilde at mcs.anl.gov
Thu Sep 16 14:07:49 CDT 2010


OK, thats in ~wilde/swiftrhang/jstack.out

The jvm is still running: pid 22435 on bridled. Let me know if you need any other traces from it.

It hung around 12:05

- Mike


bri$ mp
UID        PID  PPID  PGID   SID  C STIME TTY          TIME CMD
wilde    19025 19023 19023 19023  0 11:49 ?        00:00:00 sshd: wilde at pts/29
wilde    19026 19025 19026 19026  0 11:49 pts/29   00:00:00   -bash
wilde    19203 19026 19203 19026  0 11:49 pts/29   00:00:00     /usr/bin/screen -x
wilde    15373 15371 15371 15371  0 06:58 ?        00:00:00 sshd: wilde at pts/14
wilde    15374 15373 15374 15374  0 06:58 pts/14   00:00:00   -bash
wilde    18291 15374 18291 15374  0 11:47 pts/14   00:00:00     /home/wilde/R/R-2.11.0/bin/exec/R
wilde    11315 15374 11315 15374  0 13:59 pts/14   00:00:00     /usr/bin/screen -x
wilde    15183 15181 15181 15181  0 06:58 ?        00:00:00 sshd: wilde at pts/11
wilde    15184 15183 15184 15184  0 06:58 pts/11   00:00:00   -bash
wilde    15835 15184 15835 15184  0 06:59 pts/11   00:00:00     /usr/bin/screen
wilde    15836 15835 15836 15836  0 06:59 ?        00:00:00       /usr/bin/SCREEN
wilde    15837 15836 15837 15837  0 06:59 pts/15   00:00:00         bash
wilde    15839 15836 15839 15839  0 06:59 pts/16   00:00:00         bash
wilde    15841 15836 15841 15841  0 06:59 pts/19   00:00:00         bash
wilde    15843 15836 15843 15843  0 06:59 pts/20   00:00:00         bash
wilde    15845 15836 15845 15845  0 06:59 pts/21   00:00:00         bash
wilde    15847 15836 15847 15847  0 06:59 pts/22   00:00:00         bash
wilde    14845 15847 14845 15847  0 10:05 pts/22   00:00:00           emacs Swift/exec/start-swift-workers
wilde    22364 15847 22364 15847  0 11:59 pts/22   00:00:00           /bin/bash Swift/exec/RunRServer.sh
wilde    22378 22364 22364 15847  0 11:59 pts/22   00:00:00             /bin/sh /home/wilde/swift/rev/trunk/bin/swift -config swift.pr
wilde    22435 22378 22364 15847  0 11:59 pts/22   00:00:29               java -Xmx256M -Djava.endorsed.dirs=/home/wilde/swift/rev/tru
wilde    12506 15847 12506 15847  0 14:03 pts/22   00:00:00           ps -fjH -u wilde
wilde    22930     1 22364 15847  0 11:59 pts/22   00:00:05 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ
wilde    22926     1 22364 15847  0 11:59 pts/22   00:00:04 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ
wilde    22920     1 22364 15847  0 11:59 pts/22   00:00:04 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ
wilde    22760     1 22364 15847  0 11:59 pts/22   00:00:04 /home/wilde/R/R-2.11.0/bin/exec/R --slave --no-restore --file=./SwiftRServ
bri$ 


----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> I can't tell what's causing the problem, but it may generally be a
> good
> idea to do a jstack -l when you get a hang.
> 
> Mihael
> 
> On Thu, 2010-09-16 at 12:14 -0600, wilde at mcs.anl.gov wrote:
> > Mihael,
> > 
> > I've developed a Swift script that loops using iterate, reading
> requests to process an R function from a named pipe (fifo), calling R,
> and replying "done" on a response fifo.
> > 
> > This has been working very well, but I just hit a case where the
> script hangs.
> > 
> > I exercise it using a small battery of R tests; I was manually
> restarting the test battery (which does hundreds of R calls in 30
> seconds or so, when it hung in the middle of the test suite.
> > 
> > As far as I can tell it hung after receiving a work request, mapping
> the files for the work request, but never called the app() function
> that invokes R.
> > 
> > The log is in ~wilde/rserver-20100916-1159-y94hftt0.log
> > (I will try to post the script and all related files, but looks like
> bridled may have just gone down for patches)
> > 
> > look for these trace lines in the log, which are issued at the start
> of every R request:
> > 
> > line 37342:
> > 
> > 2010-09-16 12:03:24,212-0500 INFO  vdl:execute END_SUCCESS
> thread=0-1-86-4 tr=bash
> > 2010-09-16 12:03:24,213-0500 INFO  apply STARTCOMPOUND
> thread=0-1-87-2 name=apply
> > 2010-09-16 12:03:24,215-0500 WARN  trace SwiftScript trace: rserver:
> got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.233
> > 2010-09-16 12:03:24,215-0500 INFO  SetFieldValue Set: done=false
> > 
> > The END_SUCCESS is the completion of the last app() in the prior
> iterate pass, which signals the response ("done") fifo using a shell
> script.
> > 
> > The trace says its starting to process the next R request, #233
> (randomly assigned)
> > 
> > after mapping 20 files (for 5 R datasets containing 2 R evaluation
> requests each)
> > 
> > it just hangs, and all I see in the log after that point is coaster
> heartbeats.
> > 
> > The last request prior to this hanging request is in the log at line
> 37137:
> > 
> > 2010-09-16 12:03:24,060-0500 INFO  vdl:execute END_SUCCESS
> thread=0-1-85-4 tr=bash
> > 2010-09-16 12:03:24,062-0500 INFO  apply STARTCOMPOUND
> thread=0-1-86-2 name=apply
> > 2010-09-16 12:03:24,062-0500 WARN  trace SwiftScript trace: rserver:
> got dir, /autonfs/home/wilde/SwiftR/SwiftR.run.174
> > 2010-09-16 12:03:24,062-0500 INFO  SetFieldValue Set: done=false
> > 
> > R request #174 (and all prior ones) completed fine, and should
> illustrate the normal processing sequence.
> > 
> > Any ideas on what to look for regarding the cause of the hang?
> > 
> > I will try to reproduce it and try to get a karajan status trace
> using swift stdin.
> > 
> > - Mike
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list