[Swift-devel] Persistent coaster service fails after several runs
Michael Wilde
wilde at mcs.anl.gov
Sun Nov 21 21:00:04 CST 2010
subject was: Re: [Swift-devel] misassignment of jobs
Re the service-side timeout, OK, will do.
Ive just re-created bug1, but its a little different than I thought.
Swift runs to the persistent coaster server lock up (ie fail to progress) and then get errors, not after a delay, but seemingly randomly. Thats likely why I was misled into thinking it was delay related.
I started a coaster server on localhost with one worker.pl.
I then run catsn.swift against it with various n (# of cat jobs) including 1, 10 , and 100.
The first several (5-10) swift runs work fine. Then I let it sleep 5 mins and tried again. That too worked fine. But then, after a few more runs, things hang.
Here's all the logs and details if you want to look into this particular run.
working in /home/wilde/swift/lab, on pads login1
The latest .log in this this is the failing case; the others worked (against the same persistent server):
login1$ ls -lt *.log | head -20
-rw-r--r-- 1 wilde ci-users 95478 Nov 21 20:41 catsn-20101121-2039-1yfngygc.log
-rw-r--r-- 1 wilde ci-users 36085 Nov 21 20:39 swift.log
-rw-r--r-- 1 wilde ci-users 272734 Nov 21 20:37 catsn-20101121-2037-7uk5fj33.log
-rw-r--r-- 1 wilde ci-users 272644 Nov 21 20:37 catsn-20101121-2037-j8xq9aie.log
-rw-r--r-- 1 wilde ci-users 272468 Nov 21 20:36 catsn-20101121-2036-4y0tnimd.log
-rw-r--r-- 1 wilde ci-users 31317 Nov 21 20:36 catsn-20101121-2036-opcvomk4.log
-rw-r--r-- 1 wilde ci-users 7183 Nov 21 20:36 catsn-20101121-2036-u59brtm4.log
-rw-r--r-- 1 wilde ci-users 7183 Nov 21 20:35 catsn-20101121-2035-360kh03b.log
-rw-r--r-- 1 wilde ci-users 7351 Nov 21 20:35 catsn-20101121-2035-8lttnn88.log
-rw-r--r-- 1 wilde ci-users 7183 Nov 21 20:30 catsn-20101121-2030-ddmo6gt3.log
-rw-r--r-- 1 wilde ci-users 7267 Nov 21 20:29 catsn-20101121-2029-sq8y6cnb.log
-rw-r--r-- 1 wilde ci-users 7179 Nov 21 20:29 catsn-20101121-2029-3su2x8v9.log
-rw-r--r-- 1 wilde ci-users 7183 Nov 21 20:29 catsn-20101121-2029-z0g50i50.log
-rw-r--r-- 1 wilde ci-users 7267 Nov 21 20:29 catsn-20101121-2029-5x6pbkde.log
The worker and service logs are in: /tmp/wilde/Swift/{server,worker}
swift is:
/scratch/local/wilde/swift/src/trunk.gomods/cog/modules/swift/dist/swift-svn/bin/swift
The test runs were all of this form, with various n as above:
login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100
I started the persistent coaster with the somewhat ugly script:
/home/wilde/swift/lab/pecos/start-mcs
(which runs a dummy job to force the server to passive mode, for the general case of workers joining and leaving the service)
I'll clean this up for re-creatability if you cant spot the issue form these logs.
Lastly, the last few runs, including the failing one, gave this on stdout/err:
login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100
Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally)
RunID: 20101121-2037-j8xq9aie
Progress:
Find: http://localhost:1985
Find: keepalive(120), reconnect - http://localhost:1985
Progress: Selecting site:64 Submitting:3 Submitted:25 Active:4 Finished successfully:4
Progress: Selecting site:52 Submitted:28 Active:3 Checking status:1 Finished successfully:16
Progress: Selecting site:36 Submitting:3 Submitted:25 Active:4 Finished successfully:32
Progress: Selecting site:23 Submitted:28 Active:3 Checking status:1 Finished successfully:45
Progress: Selecting site:7 Submitted:27 Active:3 Checking status:1 Finished successfully:62
Progress: Submitted:14 Active:2 Stage out:3 Finished successfully:81
Progress: Submitted:3 Stage out:3 Finished successfully:94
Final status: Finished successfully:100
login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100
Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally)
RunID: 20101121-2037-7uk5fj33
Progress:
Find: http://localhost:1985
Find: keepalive(120), reconnect - http://localhost:1985
Progress: Selecting site:64 Submitted:28 Active:3 Checking status:1 Finished successfully:4
Progress: Selecting site:48 Submitting:3 Submitted:25 Active:4 Finished successfully:20
Progress: Selecting site:36 Submitted:28 Active:3 Checking status:1 Finished successfully:32
Progress: Selecting site:23 Submitted:24 Active:4 Stage out:3 Finished successfully:46
Progress: Selecting site:6 Submitted:28 Active:3 Checking status:1 Finished successfully:62
Progress: Submitted:17 Active:3 Checking status:1 Finished successfully:79
Progress: Submitted:3 Active:1 Stage out:3 Finished successfully:93
Final status: Finished successfully:100
login1$ swift -config cf -tc.file tc -sites.file pecos04.xml catsn.swift -n=100
Swift svn swift-r3707 (swift modified locally) cog-r2932 (cog modified locally)
RunID: 20101121-2039-1yfngygc
Progress:
Find: http://localhost:1985
Find: keepalive(120), reconnect - http://localhost:1985
Progress: Selecting site:68 Submitting:32
Progress: Selecting site:68 Submitting:32
Progress: Selecting site:68 Submitting:32
Progress: Selecting site:68 Submitting:32
Command(1, CHANNELCONFIG): handling reply timeout; sendReqTime=101121-203902.376, sendTime=101121-203902.377, now=101121-204102.399
Command(1, CHANNELCONFIG)fault was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Progress: Selecting site:68 Submitting:31 Failed but can retry:1
login1$
-----
- Mike
----- Original Message -----
> Right. I would hold off on the service timeout. My tests show that it
> has no impact, and, in theory, that both shouldn't have an impact and
> it
> should not be removed.
>
> Mihael
>
> On Sun, 2010-11-21 at 20:45 -0600, Michael Wilde wrote:
> > I was testing with the two mods below in place (long values in both
> > worker timeout and service timeout).
> >
> > - Mike
> >
> > login1$ pwd
> > /scratch/local/wilde/swift/src/trunk.gomods/cog/modules/provider-coaster
> > login1$
> >
> > login1$ svn diff
> > Index:
> > src/org/globus/cog/abstraction/coaster/service/CoasterService.java
> > ===================================================================
> > ---
> > src/org/globus/cog/abstraction/coaster/service/CoasterService.java
> > (revision 2932)
> > +++
> > src/org/globus/cog/abstraction/coaster/service/CoasterService.java
> > (working copy)
> > @@ -41,7 +41,7 @@
> > public static final Logger logger = Logger
> > .getLogger(CoasterService.class);
> >
> > - public static final int IDLE_TIMEOUT = 120 * 1000;
> > + public static final int IDLE_TIMEOUT = 120 * 1000 /* extend it: */
> > * 30 * 240;
> >
> > public static final int CONNECT_TIMEOUT = 2 * 60 * 1000;
> >
> > Index: resources/worker.pl
> > ===================================================================
> > --- resources/worker.pl (revision 2932)
> > +++ resources/worker.pl (working copy)
> > @@ -123,7 +123,7 @@
> > my $URISTR=$ARGV[0];
> > my $BLOCKID=$ARGV[1];
> > my $LOGDIR=$ARGV[2];
> > -my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60) : $ARGV[3];
> > +my $IDLETIMEOUT = ( $#ARGV <= 2 ) ? (4 * 60 * 60 * 24) : $ARGV[3];
> >
> >
> > # REQUESTS holds a map of incoming requests
> > login1$
> >
> >
> > ----- Original Message -----
> > > Ok. I will remove the idle timeouts from the worker. I do not
> > > expect
> > > any
> > > negative consequences there given the reasoning I outlined before.
> > >
> > > Mihael
> > >
> > > On Sun, 2010-11-21 at 19:37 -0600, Michael Wilde wrote:
> > > > OK, re bug 2: I didnt connect the symptoms of this issue with
> > > > your
> > > > earlier comments on timeouts, and just verified that you are
> > > > correct: with the same extended timeouts I was using to try to
> > > > keep
> > > > a persistent coaster service up for an extended time, the
> > > > failing
> > > > case for bug 2 works.
> > > >
> > > > I'll try to reproduce bug 1 now, then 3.
> > > >
> > > > - Mike
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > On Sun, 2010-11-21 at 17:10 -0600, Michael Wilde wrote:
> > > > > > Mihael,
> > > > > >
> > > > > > If you're in fixin' mode,
> > > > >
> > > > > I've been in fixin' mode for the past two months :)
> > > > >
> > > > > > I'll spend some time now trying to reproduce the 3 coaster
> > > > > > problems
> > > > > > that are high on my "needed for users" list:
> > > > > >
> > > > > > 1. Swift hangs/fails talking to persistent server if it
> > > > > > sites
> > > > > > idle
> > > > > > for
> > > > > > a few minutes, even with large timeout values (which were
> > > > > > possibly
> > > > > > not
> > > > > > set correctly or fully).
> > > > > >
> > > > > > 2. With normal coaster mode, if workers start toiming out
> > > > > > for
> > > > > > lack
> > > > > > of work, the Swift run dies.
> > > > >
> > > > > That one is addressed by removing the worker timeout. As I
> > > > > mentioned
> > > > > in
> > > > > a previous email, that timeout is a artifact of an older
> > > > > worker
> > > > > management scheme.
> > > > >
> > > > > >
> > > > > > 3. Errors in provider staging at high volume.
> > > > > >
> > > > > > If you already have test cases for these issues, let me
> > > > > > know,
> > > > > > and
> > > > > > I'll
> > > > > > focus on the missing ones. But Im assuming for now you need
> > > > > > all
> > > > > > three.
> > > > >
> > > > > I have test cases for 1 and 3. I couldn't reproduce the
> > > > > problems
> > > > > so
> > > > > far.
> > > > >
> > > > > Mihael
> > > >
> >
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list