[Swift-devel] Persistent coaster service fails after several runs

Sun Nov 21 21:00:04 CST 2010

subject was: Re: [Swift-devel] misassignment of jobs

Re the service-side timeout, OK, will do.

Ive just re-created bug1, but its a little different than I thought.

Swift runs to the persistent coaster server lock up (ie fail to progress) and then get errors, not after a delay, but seemingly randomly. Thats likely why I was misled into thinking it was delay related.

I started a coaster server on localhost with one worker.pl.

I then run catsn.swift against it with various n (# of cat jobs) including 1, 10 , and 100.

The first several (5-10) swift runs work fine.  Then I let it sleep 5 mins and tried again. That too worked fine.  But then, after a few more runs, things hang.

Here's all the logs and details if you want to look into this particular run.

working in /home/wilde/swift/lab, on pads login1

The latest .log in this this is the failing case; the others worked (against the same persistent server):

login1$ ls -lt *.log | head -20 
-rw-r--r-- 1 wilde ci-users   95478 Nov 21 20:41 catsn-20101121-2039-1yfngygc.log
-rw-r--r-- 1 wilde ci-users   36085 Nov 21 20:39 swift.log
-rw-r--r-- 1 wilde ci-users  272734 Nov 21 20:37 catsn-20101121-2037-7uk5fj33.log
-rw-r--r-- 1 wilde ci-users  272644 Nov 21 20:37 catsn-20101121-2037-j8xq9aie.log
-rw-r--r-- 1 wilde ci-users  272468 Nov 21 20:36 catsn-20101121-2036-4y0tnimd.log
-rw-r--r-- 1 wilde ci-users   31317 Nov 21 20:36 catsn-20101121-2036-opcvomk4.log
-rw-r--r-- 1 wilde ci-users    7183 Nov 21 20:36 catsn-20101121-2036-u59brtm4.log
-rw-r--r-- 1 wilde ci-users    7183 Nov 21 20:35 catsn-20101121-2035-360kh03b.log
-rw-r--r-- 1 wilde ci-users    7351 Nov 21 20:35 catsn-20101121-2035-8lttnn88.log
-rw-r--r-- 1 wilde ci-users    7183 Nov 21 20:30 catsn-20101121-2030-ddmo6gt3.log
-rw-r--r-- 1 wilde ci-users    7267 Nov 21 20:29 catsn-20101121-2029-sq8y6cnb.log
-rw-r--r-- 1 wilde ci-users    7179 Nov 21 20:29 catsn-20101121-2029-3su2x8v9.log
-rw-r--r-- 1 wilde ci-users    7183 Nov 21 20:29 catsn-20101121-2029-z0g50i50.log
-rw-r--r-- 1 wilde ci-users    7267 Nov 21 20:29 catsn-20101121-2029-5x6pbkde.log

The worker and service logs are in: /tmp/wilde/Swift/{server,worker}

swift  is:

/scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift

The test runs were all of this form, with various n as above:

login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100

I started the persistent coaster with the somewhat ugly script:

/home/wilde/swift/lab/pecos/start-mcs

(which runs a dummy job to force the server to passive mode, for the general case of workers joining and leaving the service)

I'll clean this up for re-creatability if you cant spot the issue form these logs.

Lastly, the last few runs, including the failing one, gave this on stdout/err:

login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100
Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally)

RunID: 20101121-2037-j8xq9aie
Progress:
Find: http://localhost:1985
Find:  keepalive(120), reconnect - http://localhost:1985
Progress:  Selecting site:64  Submitting:3  Submitted:25  Active:4  Finished successfully:4
Progress:  Selecting site:52  Submitted:28  Active:3  Checking status:1  Finished successfully:16
Progress:  Selecting site:36  Submitting:3  Submitted:25  Active:4  Finished successfully:32
Progress:  Selecting site:23  Submitted:28  Active:3  Checking status:1  Finished successfully:45
Progress:  Selecting site:7  Submitted:27  Active:3  Checking status:1  Finished successfully:62
Progress:  Submitted:14  Active:2  Stage out:3  Finished successfully:81
Progress:  Submitted:3  Stage out:3  Finished successfully:94
Final status:  Finished successfully:100
login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100
Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally)

RunID: 20101121-2037-7uk5fj33
Progress:
Find: http://localhost:1985
Find:  keepalive(120), reconnect - http://localhost:1985
Progress:  Selecting site:64  Submitted:28  Active:3  Checking status:1  Finished successfully:4
Progress:  Selecting site:48  Submitting:3  Submitted:25  Active:4  Finished successfully:20
Progress:  Selecting site:36  Submitted:28  Active:3  Checking status:1  Finished successfully:32
Progress:  Selecting site:23  Submitted:24  Active:4  Stage out:3  Finished successfully:46
Progress:  Selecting site:6  Submitted:28  Active:3  Checking status:1  Finished successfully:62
Progress:  Submitted:17  Active:3  Checking status:1  Finished successfully:79
Progress:  Submitted:3  Active:1  Stage out:3  Finished successfully:93
Final status:  Finished successfully:100
login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100
Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally)

RunID: 20101121-2039-1yfngygc
Progress:
Find: http://localhost:1985
Find:  keepalive(120), reconnect - http://localhost:1985
Progress:  Selecting site:68  Submitting:32
Progress:  Selecting site:68  Submitting:32
Progress:  Selecting site:68  Submitting:32
Progress:  Selecting site:68  Submitting:32
Command(1, CHANNELCONFIG): handling reply timeout; sendReqTime=101121-203902.376, sendTime=101121-203902.377, now=101121-204102.399
Command(1, CHANNELCONFIG)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
        at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
        at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
Progress:  Selecting site:68  Submitting:31 Failed but can retry:1
login1$ 

-----

- Mike

----- Original Message -----
> Right. I would hold off on the service timeout. My tests show that it
> has no impact, and, in theory, that both shouldn't have an impact and
> it
> should not be removed.
> 
> Mihael
> 
> On Sun, 2010-11-21 at 20:45 -0600, Michael Wilde wrote:
> > I was testing with the two mods below in place (long values in both
> > worker timeout and service timeout).
> >
> > - Mike
> >
> > login1$ pwd
> > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/provider-coaster
> > login1$
> >
> > login1$ svn diff
> > Index:
> > src/org/globus/cog/abstraction/coaster/service/CoasterService.java
> > ===================================================================
> > ---
> > src/org/globus/cog/abstraction/coaster/service/CoasterService.java
> > (revision 2932)
> > +++
> > src/org/globus/cog/abstraction/coaster/service/CoasterService.java
> > (working copy)
> > @@ -41,7 +41,7 @@
> >      public static final Logger logger = Logger
> >              .getLogger(CoasterService.class);
> >
> > - public static final int IDLE_TIMEOUT = 120 * 1000;
> > + public static final int IDLE_TIMEOUT = 120 * 1000 /* extend it: */
> > * 30 * 240;
> >
> >      public static final int CONNECT_TIMEOUT = 2 * 60 * 1000;
> >
> > Index: resources/worker.pl
> > ===================================================================
> > --- resources/worker.pl (revision 2932)
> > +++ resources/worker.pl (working copy)
> > @@ -123,7 +123,7 @@
> >  my $URISTR=$ARGV[0];
> >  my $BLOCKID=$ARGV[1];
> >  my $LOGDIR=$ARGV[2];
> > -my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60) : $ARGV[3];
> > +my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60 * 60 * 24) : $ARGV[3];
> >
> >
> >  # REQUESTS holds a map of incoming requests
> > login1$
> >
> >
> > ----- Original Message -----
> > > Ok. I will remove the idle timeouts from the worker. I do not
> > > expect
> > > any
> > > negative consequences there given the reasoning I outlined before.
> > >
> > > Mihael
> > >
> > > On Sun, 2010-11-21 at 19:37 -0600, Michael Wilde wrote:
> > > > OK, re bug 2: I didnt connect the symptoms of this issue with
> > > > your
> > > > earlier comments on timeouts, and just verified that you are
> > > > correct: with the same extended timeouts I was using to try to
> > > > keep
> > > > a persistent coaster service up for an extended time, the
> > > > failing
> > > > case for bug 2 works.
> > > >
> > > > I'll try to reproduce bug 1 now, then 3.
> > > >
> > > > - Mike
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote:
> > > > > > Mihael,
> > > > > >
> > > > > > If you're in fixin' mode,
> > > > >
> > > > > I've been in fixin' mode for the past two months :)
> > > > >
> > > > > >  I'll spend some time now trying to reproduce the 3 coaster
> > > > > >  problems
> > > > > >  that are high on my "needed for users" list:
> > > > > >
> > > > > > 1. Swift hangs/fails talking to persistent server if it
> > > > > > sites
> > > > > > idle
> > > > > > for
> > > > > > a few minutes, even with large timeout values (which were
> > > > > > possibly
> > > > > > not
> > > > > > set correctly or fully).
> > > > > >
> > > > > > 2. With normal coaster mode, if workers start toiming out
> > > > > > for
> > > > > > lack
> > > > > > of work, the Swift run dies.
> > > > >
> > > > > That one is addressed by removing the worker timeout. As I
> > > > > mentioned
> > > > > in
> > > > > a previous email, that timeout is a artifact of an older
> > > > > worker
> > > > > management scheme.
> > > > >
> > > > > >
> > > > > > 3. Errors in provider staging at high volume.
> > > > > >
> > > > > > If you already have test cases for these issues, let me
> > > > > > know,
> > > > > > and
> > > > > > I'll
> > > > > > focus on the missing ones. But Im assuming for now you need
> > > > > > all
> > > > > > three.
> > > > >
> > > > > I have test cases for 1 and 3. I couldn't reproduce the
> > > > > problems
> > > > > so
> > > > > far.
> > > > >
> > > > > Mihael
> > > >
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory