[Swift-devel] Re: the persistence of the persistent coaster service.

Michael Wilde wilde at mcs.anl.gov
Wed Nov 17 23:25:19 CST 2010


Allan, Ive had similar symptoms, but think that Im seeing different error messages.

When I start a persistent service, I can run repeated Swift scripts against it, but *only* if I do them in fairly quick succession.  If I let the service sit idle for more than about 5 minutes, it becomes unusable.

I need to carefully capture a test case, as well as testing on an unmodified trunk that would enable Mihael to reproduce and fix the problem.  I think thats the key: If you can give Mihael a way to easily reproduce the problem at will, then he'll likely be able to fix it quickly.

I also see a possible related problem: when I run coasters with a large number of slots (say 64) and my workload is unable to keep the workers busy due to staging delays, then after the workers start timing out (ie I get the message "Job Cancelled") then this causes an error somewhere on the client side and Swift quickly dies with a fatal error. I need to try to reproduce this as well and/or capture logs from it.

I hope to get to this next week after SC.

- Mike




----- Original Message -----
> Upon the client's connection, this gets registered in the service log:
> 
> ...
> ...
> Plan time: 1
> Plan time: 1
> GSSSChannel-null(0)[1175215772: {}]: Disabling heartbeats (config is
> null)
> (1) Scheduling GSSSChannel-null(12)[1175215772: {}] for addition
> nullChannel started
> Channel id: u-20ccd0f-12c5bc25c45--8000-u-28c73091-12c5b774ab1--7ff5
> MetaChannel: 682820082[1175215772: {}] -> null: Disabling heartbeats
> (disabled in config)
> MetaChannel: 682820082[1175215772: {}] -> null.bind ->
> GSSSChannel-null(12)[1175215772: {}]
> Plan time: 1
> Congestion queue size: 0
> runTime: 0, sleepTime: 10049
> Plan time: 1
> ...
> ...
> 
> 2010/11/17 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> > Bumping the thread. In an attempt to isolate the bug, I made this
> > workflow:
> >
> > app (external o) sleep(int time) {
> >  sleep time;
> > }
> >
> >
> > /* Main program */
> > external rups[];
> >
> > int t = 300;
> > int a[];
> >
> > iterate ix {
> >  a[ix] = ix;
> > } until (ix == 1300);
> >
> > foreach ai,i in a {
> >  rups[i] = sleep(t);
> > }
> >
> >
> > <config>
> >  <pool handle="localhost">
> >    <execution provider="coaster-persistent"
> > url="https://communicado.ci.uchicago.edu:61999"
> >        jobmanager="local:local" />
> >
> >    <profile namespace="globus" key="workerManager">passive</profile>
> >
> >    <gridftp url="local://localhost"/>
> >    <workdirectory>/gpfs/pads/swift/aespinosa/swift-runs</workdirectory>
> >  </pool>
> >
> >
> > </config>
> >
> > localhost sleep /bin/sleep INSTALLED INTEL32::LINUX null
> >
> > and still get the same type of error message:
> > RunID: 20101117-1527-ui6i2lra
> > Progress:
> > Find: https://communicado.ci.uchicago.edu:61999
> > Find: keepalive(120), reconnect -
> > https://communicado.ci.uchicago.edu:61999
> > Progress: Selecting site:1 Submitting:294
> > Progress: Selecting site:3 Submitting:367
> > Progress: Selecting site:3 Submitting:367
> > Progress: Selecting site:3 Submitting:367
> > Progress: Selecting site:3 Submitting:367
> > Command(1, CHANNELCONFIG): handling reply timeout;
> > sendReqTime=101117-152717.209, sendTime=101117
> > -152717.211, now=101117-152917.232
> > Progress: Selecting site:3 Submitting:366 Submitted:1
> > Command(1, CHANNELCONFIG)fault was: Reply timeout
> > org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> >        at
> >        org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.ja
> > va:280)
> >        at
> >        org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
> >        at java.util.TimerThread.mainLoop(Timer.java:512)
> >        at java.util.TimerThread.run(Timer.java:462)
> > Progress: Selecting site:3 Submitting:366 Failed but can retry:1
> > Progress: Selecting site:3 Submitting:366 Failed but can retry:1
> >
> >
> > 2010/10/21 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> >> Hi,
> >>
> >> When I'm reusing the coaster service onto the next swift session, i
> >> get reply timeouts in the CHANNELCONFIG command:
> >>
> >>
> >> swift-r3685 cog-r2913
> >>
> >> RunID: extract
> >> Progress:
> >> Progress: uninitialized:2 Finished in previous run:2
> >> Progress: uninitialized:2 Finished in previous run:2
> >> Progress: Stage in:99 Submitting:1 Finished in previous run:102
> >> Find: https://communicado.ci.uchicago.edu:61999
> >> Find: keepalive(120), reconnect -
> >> https://communicado.ci.uchicago.edu:61999
> >> Progress: Stage in:92 Submitting:8 Finished in previous run:102
> >> Passive queue processor initialized. Callback URI is
> >> http://128.135.125.17:60999
> >> Progress: Stage in:71 Submitting:2 Submitted:27 Finished in
> >> previous run:102
> >> Progress: Stage in:29 Submitting:1 Submitted:70 Finished in
> >> previous run:102
> >>
> >> **Abord** (Ctrl-C)
> >> ** rerun/ resume workflow **
> >> swift-r3685 cog-r2913
> >>
> >> RunID: extract
> >> Progress:
> >> Progress: uninitialized:3 Finished in previous run:2
> >> Progress: Stage in:99 Submitting:1 Finished in previous run:102
> >> Find: https://communicado.ci.uchicago.edu:61999
> >> Find: keepalive(120), reconnect -
> >> https://communicado.ci.uchicago.edu:61999
> >> Progress: Stage in:92 Submitting:8 Finished in previous run:102
> >> Progress: Stage in:92 Submitting:8 Finished in previous run:102
> >> Progress: Stage in:92 Submitting:8 Finished in previous run:102
> >> Progress: Stage in:92 Submitting:8 Finished in previous run:102
> >> Command(1, CHANNELCONFIG): handling reply timeout;
> >> sendReqTime=101021-174124.460, sendTime=101021-174124.471,
> >> now=101021-174324.492
> >> Command(1, CHANNELCONFIG)fault was: Reply timeout
> >> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> >>        at
> >>        org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:280)
> >>        at
> >>        org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:285)
> >>        at java.util.TimerThread.mainLoop(Timer.java:512)
> >>        at java.util.TimerThread.run(Timer.java:462)
> >> Progress: Stage in:92 Submitting:7 Submitted:1 Finished in previous
> >> run:102
> >>
> >> My sites.xml sets the persistent service to work in passive mode.
> >>
> >>
> >> thanks,
> >> -Allan
> >>
> >> --
> >> Allan M. Espinosa <http://amespinosa.wordpress.com>
> >> PhD student, Computer Science
> >> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> >>
> >
> >
> >
> > --
> > Allan M. Espinosa <http://amespinosa.wordpress.com>
> > PhD student, Computer Science
> > University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> >
> 
> 
> 
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list