[Swift-devel] coasters about half the jobs

Sat Feb 19 10:15:24 CST 2011

Mihael, I need to correct one point I made below: 

> But the scheduling behavior still shows a similar problem, in that
> only about half the cores are utilized.

As I was pasting the output I realized that the workers *were* getting filled to 100%, but then later the utilization dropped off and did not seem to recover.

In later experiments I tested with a single large coaster block of compute nodes instead of many small one-node blocks.  This showed some interesting behavior (but I think much better utilization).  There I got an oscilating pattern, where I would have all 2400 nodes utilized, then what seemed like sinusoidal dips to about 2100 cores, then back up to 2400, etc.  (I cant tell without plotting if its really in fact a  sinusoid).

Im a bit suspicious of some interaction in this run with the foreach.maxthreads throttle, as that throttle is set only 100 jobs higher than the #workers, and I see curious reporting of the value of "submitted", which does not seem to stay at 100 like I would expect.

Since the large node blocks seem to work now, Im going to try to get a science-production run going and we can come back to the scheduling behavior later.  I'll post the log from the big-block run shortly and maybe you can see the pattern and issue from that.

Thanks,

Mike

----- Original Message -----
> r3053 nicely fixes the problem with coaster blocks getting cancelled
> prematurely.
> But the scheduling behavior still shows a similar problem, in that
> only about half the cores are utilized.
> 
> I ran the same setup: foreach.max threads of 50 to run about 2500 jobs
> at once. Throttle of 25.0 to throttle coasters similarly. 100 slots.
> Slots have 27 hour walltime (100Ksecs). App maxwalltime in tc.data of
> 1 hour.
> 
> The logs are on CI net at /home/wilde/mp/mp04:
> ftdock-20110218-2307-xfdlhkd5.{log,stdout}
> 
> Pretty much same execution pattern occurred:
> 
> > Stage in completes rapidly and jobs are started:
> 
> Progress: uninitialized:3
> Progress: Selecting site:2499 Initializing site shared directory:1
> Progress: Selecting site:1100 Initializing site shared directory:1
> Stage in:1399
> Progress: Stage in:2360 Submitting:140
> Progress: Stage in:1630 Submitting:869 Submitted:1
> Progress: Stage in:1630 Submitting:820 Submitted:50
> Progress: Stage in:1625 Submitting:442 Submitted:433
> Progress: Stage in:1625 Submitting:79 Submitted:796
> Progress: Stage in:1368 Submitting:12 Submitted:1120
> Progress: Stage in:1037 Submitting:58 Submitted:1357 Active:48
> Progress: Stage in:302 Submitting:269 Submitted:1812 Active:117
> Progress: Submitted:2331 Active:169
> Progress: Submitted:2259 Active:241
> Progress: Submitted:2211 Active:289
> 
> > This time, we get all coaster slots filled pretty quickly:
> 
> Progress: Submitted:219 Active:2281
> Progress: Submitted:147 Active:2353
> Progress: Submitted:100 Active:2400
> 
> > Then jobs start finishing:
> 
> Progress: Submitted:100 Active:2399 Checking status:1 Finished
> successfully:2
> Progress: Submitted:100 Active:2399 Finished successfully:7
> Progress: Submitted:100 Active:2399 Checking status:1 Finished
> successfully:14
> 
> > Workers stay filled until about 800 jobs finish. Then the worker
> > utilization level starts dropping off, monotonically:
> 
> Progress: Submitted:95 Active:2399 Checking status:1 Finished
> successfully:747
> Progress: Submitted:90 Active:2398 Finished successfully:760
> Progress: Submitted:81 Active:2398 Checking status:1 Stage out:2
> Finished successfully:773
> Progress: Submitted:71 Active:2399 Checking status:1 Finished
> successfully:797
> Progress: Submitted:68 Active:2398 Checking status:1 Stage out:1
> Finished successfully:809
> Progress: Submitted:63 Active:2397 Finished successfully:827
> Progress: Submitted:64 Active:2392 Checking status:1 Finished
> successfully:852
> Progress: Stage in:1 Submitted:64 Active:2385 Stage out:1 Finished
> successfully:869
> Progress: Submitting:2 Submitted:59 Active:2379 Finished
> successfully:890
> Progress: Submitted:62 Active:2372 Finished successfully:909
> 
> > The dropoff continues till I hit ^C on the run:
> 
> Progress: Stage in:1 Submitted:1174 Active:1024 Finished
> successfully:3591
> Progress: Submitted:1174 Active:1024 Checking status:1 Finished
> successfully:3591
> Progress: Submitted:1175 Active:1023 Checking status:1 Finished
> successfully:3593
> Progress: Submitted:1175 Active:1023 Stage out:1 Finished
> successfully:3596
> Progress: Submitted:1177 Active:1021 Checking status:1 Finished
> successfully:3601
> Progress: Submitted:1178 Active:1020 Checking status:1 Finished
> successfully:3604
> Shutting down worker
> 
> Shutting down worker
> 
> > Just before I stopped the run, I checked a few times on # running
> > worker blocks in PBS, and saw this:
> 
> login1$ qstat -u wilde | grep ' R ' | wc -l
> 99
> 
> I caught at least 1 job in a "C" state. Looks like 1 worker of 100
> died, for separate reasons we can explore later. Or, could the one
> worker termination have triggered the worker-underutilization anomaly?
> 
> With a duration of 100,000 secs / 27 hours I would have expected the
> workers to stay up (In the absence of fatal node errors, I guess:
> which may be a possibility if that one worker died from OOM errors???
> I wonder if workers can report RAM pressure stats back to the service?
> For now, just as logging info; later as a scheduling criteria?)
> 
> I will try now to run with multi-job coaster blocks. If it works, I'll
> try with one big block and see how the scheduler handles that config.
> 
> - Mike
> 
> 
> ----- Original Message -----
> > And sorry about that.
> >
> > r3053 should fix that.
> >
> > On Fri, 2011-02-18 at 20:01 -0800, Mihael Hategan wrote:
> > > Thanks.
> > >
> > > On Fri, 2011-02-18 at 21:58 -0600, Michael Wilde wrote:
> > > > It fails for 10- and 1-job runs as well.
> > > >
> > > > - Mike
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > Just tried this on Beagle with similar workload to the one
> > > > > that
> > > > > shoes
> > > > > the original problem. I got:
> > > > >
> > > > > Progress: Stage in:2486 Submitting:14
> > > > > Progress: Stage in:1712 Submitting:787 Submitted:1
> > > > > queuedsize > 0 but no job dequeued. Queued: {}
> > > > > java.lang.Throwable
> > > > > at
> > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:253)
> > > > >
> > > > > Logs are in:
> > > > >
> > > > > login1$ cat out.pdb.all.00
> > > > > Swift svn swift-r4061 (swift modified locally) cog-r3052 (cog
> > > > > modified
> > > > > locally)
> > > > >
> > > > > Output on stdout/err is below.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Mike
> > > > >
> > > > > RunID: 20110218-2137-v87vupcc
> > > > > Progress:
> > > > > SwiftScript trace: 10gs-1
> > > > > SwiftScript trace: 1a1u-1
> > > > > SwiftScript trace: 1m3g-1
> > > > > SwiftScript trace: 1a1x-1
> > > > > SwiftScript trace: 1a1m-1
> > > > > SwiftScript trace: 1a12-1
> > > > > SwiftScript trace: 1m62-1
> > > > > SwiftScript trace: 1a22-1
> > > > > SwiftScript trace: 121p-1
> > > > > SwiftScript trace: 1a4p-1
> > > > > SwiftScript trace: 1m6b-1
> > > > > SwiftScript trace: 1m7b-1
> > > > > SwiftScript trace: 1m9i-1
> > > > > SwiftScript trace: 1mi1-1
> > > > > SwiftScript trace: 1m6b-2
> > > > > SwiftScript trace: 1a22-2
> > > > > SwiftScript trace: 1mfg-1
> > > > > SwiftScript trace: 1m9j-1
> > > > > SwiftScript trace: 1a1w-1
> > > > > SwiftScript trace: 1mdi-1
> > > > > SwiftScript trace: 1mq1-1
> > > > > SwiftScript trace: 1mp1-1
> > > > > SwiftScript trace: 1mq0-1
> > > > > SwiftScript trace: 1mk3-1
> > > > > SwiftScript trace: 1mj4-1
> > > > > SwiftScript trace: 1mil-1
> > > > > SwiftScript trace: 1mr1-1
> > > > > SwiftScript trace: 1nbq-1
> > > > > SwiftScript trace: 1mr8-1
> > > > > SwiftScript trace: 1mr1-2
> > > > > SwiftScript trace: 1n4m-2
> > > > > SwiftScript trace: 1n83-1
> > > > > SwiftScript trace: 1mm2-1
> > > > > SwiftScript trace: 1nd7-1
> > > > > SwiftScript trace: 1nm8-1
> > > > > SwiftScript trace: 1n4m-3
> > > > > SwiftScript trace: 1nfi-2
> > > > > SwiftScript trace: 1nou-2
> > > > > SwiftScript trace: 1nou-1
> > > > > SwiftScript trace: 1nfi-1
> > > > > SwiftScript trace: 1o5e-1
> > > > > SwiftScript trace: 1o6u-2
> > > > > SwiftScript trace: 1nty-1
> > > > > SwiftScript trace: 1mx3-1
> > > > > SwiftScript trace: 1n3u-2
> > > > > SwiftScript trace: 1muz-1
> > > > > SwiftScript trace: 1o86-1
> > > > > SwiftScript trace: 1n3u-1
> > > > > SwiftScript trace: 1oa8-1
> > > > > SwiftScript trace: 1oc0-1
> > > > > Progress: uninitialized:3
> > > > > Progress: Initializing:1311 Selecting site:1189
> > > > > Progress: Selecting site:2499 Initializing site shared
> > > > > directory:1
> > > > > Progress: Selecting site:2340 Initializing site shared
> > > > > directory:1
> > > > > Stage in:159
> > > > > Progress: Stage in:2486 Submitting:14
> > > > > Progress: Stage in:1712 Submitting:787 Submitted:1
> > > > > queuedsize > 0 but no job dequeued. Queued: {}
> > > > > java.lang.Throwable
> > > > > at
> > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:253)
> > > > > at
> > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:521)
> > > > > at
> > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)
> > > > > queuedsize > 0 but no job dequeued. Queued: {}
> > > > > java.lang.Throwable
> > > > > at
> > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:253)
> > > > > at
> > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:521)
> > > > > at
> > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)
> > > > > login1$ finger kelly
> > > > >
> > > > >
> > > > > Logs are on CT net in /home/wilde/mp/mp04:
> > > > > cp ftdock-20110218-2137-v87vupcc.log out.pdb.all.00 ~/mp/mp04/
> > > > >
> > > > > - Mike
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > > There was a bug in the block allocation scheme that would
> > > > > > cause
> > > > > > blocks
> > > > > > to be kept, in the long run, at about half of what would
> > > > > > normally be
> > > > > > necessary. This included shutting down perfectly good blocks
> > > > > > that
> > > > > > could
> > > > > > be used for jobs. The effect was more dramatic when the
> > > > > > maximum
> > > > > > block
> > > > > > size was 1.
> > > > > >
> > > > > > I committed a fix for this in the stable branch (cog r3052).
> > > > > > If
> > > > > > you've
> > > > > > experienced the above, you could give this a try. It would
> > > > > > also be
> > > > > > helpful if you gave it a try anyway, just to check if things
> > > > > > are
> > > > > > going
> > > > > > ok.
> > > > > >
> > > > > > Mihael
> > > > > >
> > > > > > _______________________________________________
> > > > > > Swift-devel mailing list
> > > > > > Swift-devel at ci.uchicago.edu
> > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > >
> > > > > --
> > > > > Michael Wilde
> > > > > Computation Institute, University of Chicago
> > > > > Mathematics and Computer Science Division
> > > > > Argonne National Laboratory
> > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory