[Swift-devel] coasters about half the jobs
Michael Wilde
wilde at mcs.anl.gov
Sat Feb 19 08:42:56 CST 2011
r3053 nicely fixes the problem with coaster blocks getting cancelled prematurely.
But the scheduling behavior still shows a similar problem, in that only about half the cores are utilized.
I ran the same setup: foreach.max threads of 50 to run about 2500 jobs at once. Throttle of 25.0 to throttle coasters similarly. 100 slots. Slots have 27 hour walltime (100Ksecs). App maxwalltime in tc.data of 1 hour.
The logs are on CI net at /home/wilde/mp/mp04:
ftdock-20110218-2307-xfdlhkd5.{log,stdout}
Pretty much same execution pattern occurred:
> Stage in completes rapidly and jobs are started:
Progress: uninitialized:3
Progress: Selecting site:2499 Initializing site shared directory:1
Progress: Selecting site:1100 Initializing site shared directory:1 Stage in:1399
Progress: Stage in:2360 Submitting:140
Progress: Stage in:1630 Submitting:869 Submitted:1
Progress: Stage in:1630 Submitting:820 Submitted:50
Progress: Stage in:1625 Submitting:442 Submitted:433
Progress: Stage in:1625 Submitting:79 Submitted:796
Progress: Stage in:1368 Submitting:12 Submitted:1120
Progress: Stage in:1037 Submitting:58 Submitted:1357 Active:48
Progress: Stage in:302 Submitting:269 Submitted:1812 Active:117
Progress: Submitted:2331 Active:169
Progress: Submitted:2259 Active:241
Progress: Submitted:2211 Active:289
> This time, we get all coaster slots filled pretty quickly:
Progress: Submitted:219 Active:2281
Progress: Submitted:147 Active:2353
Progress: Submitted:100 Active:2400
> Then jobs start finishing:
Progress: Submitted:100 Active:2399 Checking status:1 Finished successfully:2
Progress: Submitted:100 Active:2399 Finished successfully:7
Progress: Submitted:100 Active:2399 Checking status:1 Finished successfully:14
> Workers stay filled until about 800 jobs finish. Then the worker utilization level starts dropping off, monotonically:
Progress: Submitted:95 Active:2399 Checking status:1 Finished successfully:747
Progress: Submitted:90 Active:2398 Finished successfully:760
Progress: Submitted:81 Active:2398 Checking status:1 Stage out:2 Finished successfully:773
Progress: Submitted:71 Active:2399 Checking status:1 Finished successfully:797
Progress: Submitted:68 Active:2398 Checking status:1 Stage out:1 Finished successfully:809
Progress: Submitted:63 Active:2397 Finished successfully:827
Progress: Submitted:64 Active:2392 Checking status:1 Finished successfully:852
Progress: Stage in:1 Submitted:64 Active:2385 Stage out:1 Finished successfully:869
Progress: Submitting:2 Submitted:59 Active:2379 Finished successfully:890
Progress: Submitted:62 Active:2372 Finished successfully:909
> The dropoff continues till I hit ^C on the run:
Progress: Stage in:1 Submitted:1174 Active:1024 Finished successfully:3591
Progress: Submitted:1174 Active:1024 Checking status:1 Finished successfully:3591
Progress: Submitted:1175 Active:1023 Checking status:1 Finished successfully:3593
Progress: Submitted:1175 Active:1023 Stage out:1 Finished successfully:3596
Progress: Submitted:1177 Active:1021 Checking status:1 Finished successfully:3601
Progress: Submitted:1178 Active:1020 Checking status:1 Finished successfully:3604
Shutting down worker
Shutting down worker
> Just before I stopped the run, I checked a few times on # running worker blocks in PBS, and saw this:
login1$ qstat -u wilde | grep ' R ' | wc -l
99
I caught at least 1 job in a "C" state. Looks like 1 worker of 100 died, for separate reasons we can explore later. Or, could the one worker termination have triggered the worker-underutilization anomaly?
With a duration of 100,000 secs / 27 hours I would have expected the workers to stay up (In the absence of fatal node errors, I guess: which may be a possibility if that one worker died from OOM errors??? I wonder if workers can report RAM pressure stats back to the service? For now, just as logging info; later as a scheduling criteria?)
I will try now to run with multi-job coaster blocks. If it works, I'll try with one big block and see how the scheduler handles that config.
- Mike
----- Original Message -----
> And sorry about that.
>
> r3053 should fix that.
>
> On Fri, 2011-02-18 at 20:01 -0800, Mihael Hategan wrote:
> > Thanks.
> >
> > On Fri, 2011-02-18 at 21:58 -0600, Michael Wilde wrote:
> > > It fails for 10- and 1-job runs as well.
> > >
> > > - Mike
> > >
> > >
> > > ----- Original Message -----
> > > > Just tried this on Beagle with similar workload to the one that
> > > > shoes
> > > > the original problem. I got:
> > > >
> > > > Progress: Stage in:2486 Submitting:14
> > > > Progress: Stage in:1712 Submitting:787 Submitted:1
> > > > queuedsize > 0 but no job dequeued. Queued: {}
> > > > java.lang.Throwable
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:253)
> > > >
> > > > Logs are in:
> > > >
> > > > login1$ cat out.pdb.all.00
> > > > Swift svn swift-r4061 (swift modified locally) cog-r3052 (cog
> > > > modified
> > > > locally)
> > > >
> > > > Output on stdout/err is below.
> > > >
> > > > Thanks!
> > > >
> > > > Mike
> > > >
> > > > RunID: 20110218-2137-v87vupcc
> > > > Progress:
> > > > SwiftScript trace: 10gs-1
> > > > SwiftScript trace: 1a1u-1
> > > > SwiftScript trace: 1m3g-1
> > > > SwiftScript trace: 1a1x-1
> > > > SwiftScript trace: 1a1m-1
> > > > SwiftScript trace: 1a12-1
> > > > SwiftScript trace: 1m62-1
> > > > SwiftScript trace: 1a22-1
> > > > SwiftScript trace: 121p-1
> > > > SwiftScript trace: 1a4p-1
> > > > SwiftScript trace: 1m6b-1
> > > > SwiftScript trace: 1m7b-1
> > > > SwiftScript trace: 1m9i-1
> > > > SwiftScript trace: 1mi1-1
> > > > SwiftScript trace: 1m6b-2
> > > > SwiftScript trace: 1a22-2
> > > > SwiftScript trace: 1mfg-1
> > > > SwiftScript trace: 1m9j-1
> > > > SwiftScript trace: 1a1w-1
> > > > SwiftScript trace: 1mdi-1
> > > > SwiftScript trace: 1mq1-1
> > > > SwiftScript trace: 1mp1-1
> > > > SwiftScript trace: 1mq0-1
> > > > SwiftScript trace: 1mk3-1
> > > > SwiftScript trace: 1mj4-1
> > > > SwiftScript trace: 1mil-1
> > > > SwiftScript trace: 1mr1-1
> > > > SwiftScript trace: 1nbq-1
> > > > SwiftScript trace: 1mr8-1
> > > > SwiftScript trace: 1mr1-2
> > > > SwiftScript trace: 1n4m-2
> > > > SwiftScript trace: 1n83-1
> > > > SwiftScript trace: 1mm2-1
> > > > SwiftScript trace: 1nd7-1
> > > > SwiftScript trace: 1nm8-1
> > > > SwiftScript trace: 1n4m-3
> > > > SwiftScript trace: 1nfi-2
> > > > SwiftScript trace: 1nou-2
> > > > SwiftScript trace: 1nou-1
> > > > SwiftScript trace: 1nfi-1
> > > > SwiftScript trace: 1o5e-1
> > > > SwiftScript trace: 1o6u-2
> > > > SwiftScript trace: 1nty-1
> > > > SwiftScript trace: 1mx3-1
> > > > SwiftScript trace: 1n3u-2
> > > > SwiftScript trace: 1muz-1
> > > > SwiftScript trace: 1o86-1
> > > > SwiftScript trace: 1n3u-1
> > > > SwiftScript trace: 1oa8-1
> > > > SwiftScript trace: 1oc0-1
> > > > Progress: uninitialized:3
> > > > Progress: Initializing:1311 Selecting site:1189
> > > > Progress: Selecting site:2499 Initializing site shared
> > > > directory:1
> > > > Progress: Selecting site:2340 Initializing site shared
> > > > directory:1
> > > > Stage in:159
> > > > Progress: Stage in:2486 Submitting:14
> > > > Progress: Stage in:1712 Submitting:787 Submitted:1
> > > > queuedsize > 0 but no job dequeued. Queued: {}
> > > > java.lang.Throwable
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:253)
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:521)
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)
> > > > queuedsize > 0 but no job dequeued. Queued: {}
> > > > java.lang.Throwable
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:253)
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:521)
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)
> > > > login1$ finger kelly
> > > >
> > > >
> > > > Logs are on CT net in /home/wilde/mp/mp04:
> > > > cp ftdock-20110218-2137-v87vupcc.log out.pdb.all.00 ~/mp/mp04/
> > > >
> > > > - Mike
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > There was a bug in the block allocation scheme that would
> > > > > cause
> > > > > blocks
> > > > > to be kept, in the long run, at about half of what would
> > > > > normally be
> > > > > necessary. This included shutting down perfectly good blocks
> > > > > that
> > > > > could
> > > > > be used for jobs. The effect was more dramatic when the
> > > > > maximum
> > > > > block
> > > > > size was 1.
> > > > >
> > > > > I committed a fix for this in the stable branch (cog r3052).
> > > > > If
> > > > > you've
> > > > > experienced the above, you could give this a try. It would
> > > > > also be
> > > > > helpful if you gave it a try anyway, just to check if things
> > > > > are
> > > > > going
> > > > > ok.
> > > > >
> > > > > Mihael
> > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list