[Swift-devel] coasters about half the jobs

Sat Feb 19 08:42:56 CST 2011

r3053 nicely fixes the problem with coaster blocks getting cancelled prematurely. 
But the scheduling behavior still shows a similar problem, in that only about half the cores are utilized.

I ran the same setup: foreach.max threads of 50 to run about 2500 jobs at once. Throttle of 25.0 to throttle coasters similarly. 100 slots. Slots have 27 hour walltime (100Ksecs). App maxwalltime in tc.data of 1 hour.

The logs are on CI net at /home/wilde/mp/mp04:
  ftdock-20110218-2307-xfdlhkd5.{log,stdout}

Pretty much same execution pattern occurred:

> Stage in completes rapidly and jobs are started:

Progress:  uninitialized:3
Progress:  Selecting site:2499  Initializing site shared directory:1
Progress:  Selecting site:1100  Initializing site shared directory:1  Stage in:1399
Progress:  Stage in:2360  Submitting:140
Progress:  Stage in:1630  Submitting:869  Submitted:1
Progress:  Stage in:1630  Submitting:820  Submitted:50
Progress:  Stage in:1625  Submitting:442  Submitted:433
Progress:  Stage in:1625  Submitting:79  Submitted:796
Progress:  Stage in:1368  Submitting:12  Submitted:1120
Progress:  Stage in:1037  Submitting:58  Submitted:1357  Active:48
Progress:  Stage in:302  Submitting:269  Submitted:1812  Active:117
Progress:  Submitted:2331  Active:169
Progress:  Submitted:2259  Active:241
Progress:  Submitted:2211  Active:289

> This time, we get all coaster slots filled pretty quickly:

Progress:  Submitted:219  Active:2281
Progress:  Submitted:147  Active:2353
Progress:  Submitted:100  Active:2400

> Then jobs start finishing:

Progress:  Submitted:100  Active:2399  Checking status:1  Finished successfully:2
Progress:  Submitted:100  Active:2399  Finished successfully:7
Progress:  Submitted:100  Active:2399  Checking status:1  Finished successfully:14

> Workers stay filled until about 800 jobs finish. Then the worker utilization level starts dropping off, monotonically:

Progress:  Submitted:95  Active:2399  Checking status:1  Finished successfully:747
Progress:  Submitted:90  Active:2398  Finished successfully:760
Progress:  Submitted:81  Active:2398  Checking status:1  Stage out:2  Finished successfully:773
Progress:  Submitted:71  Active:2399  Checking status:1  Finished successfully:797
Progress:  Submitted:68  Active:2398  Checking status:1  Stage out:1  Finished successfully:809
Progress:  Submitted:63  Active:2397  Finished successfully:827
Progress:  Submitted:64  Active:2392  Checking status:1  Finished successfully:852
Progress:  Stage in:1  Submitted:64  Active:2385  Stage out:1  Finished successfully:869
Progress:  Submitting:2  Submitted:59  Active:2379  Finished successfully:890
Progress:  Submitted:62  Active:2372  Finished successfully:909

> The dropoff continues till I hit ^C on the run:

Progress:  Stage in:1  Submitted:1174  Active:1024  Finished successfully:3591
Progress:  Submitted:1174  Active:1024  Checking status:1  Finished successfully:3591
Progress:  Submitted:1175  Active:1023  Checking status:1  Finished successfully:3593
Progress:  Submitted:1175  Active:1023  Stage out:1  Finished successfully:3596
Progress:  Submitted:1177  Active:1021  Checking status:1  Finished successfully:3601
Progress:  Submitted:1178  Active:1020  Checking status:1  Finished successfully:3604
Shutting down worker

Shutting down worker

> Just before I stopped the run, I checked a few times on # running worker blocks in PBS, and saw this:

login1$ qstat -u wilde | grep ' R ' | wc -l
99

I caught at least 1 job in a "C" state.  Looks like 1 worker of 100 died, for separate reasons we can explore later.  Or, could the one worker termination have triggered the worker-underutilization anomaly?

With a duration of 100,000 secs / 27 hours I would have expected the workers to stay up (In the absence of fatal node errors, I guess: which may be a possibility if that one worker died from OOM errors??? I wonder if workers can report RAM pressure stats back to the service? For now, just as logging info; later as a scheduling criteria?)

I will try now to run with multi-job coaster blocks. If it works, I'll try with one big block and see how the scheduler handles that config.

- Mike

----- Original Message -----
> And sorry about that.
> 
> r3053 should fix that.
> 
> On Fri, 2011-02-18 at 20:01 -0800, Mihael Hategan wrote:
> > Thanks.
> >
> > On Fri, 2011-02-18 at 21:58 -0600, Michael Wilde wrote:
> > > It fails for 10- and 1-job runs as well.
> > >
> > > - Mike
> > >
> > >
> > > ----- Original Message -----
> > > > Just tried this on Beagle with similar workload to the one that
> > > > shoes
> > > > the original problem. I got:
> > > >
> > > > Progress: Stage in:2486 Submitting:14
> > > > Progress: Stage in:1712 Submitting:787 Submitted:1
> > > > queuedsize > 0 but no job dequeued. Queued: {}
> > > > java.lang.Throwable
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:253)
> > > >
> > > > Logs are in:
> > > >
> > > > login1$ cat out.pdb.all.00
> > > > Swift svn swift-r4061 (swift modified locally) cog-r3052 (cog
> > > > modified
> > > > locally)
> > > >
> > > > Output on stdout/err is below.
> > > >
> > > > Thanks!
> > > >
> > > > Mike
> > > >
> > > > RunID: 20110218-2137-v87vupcc
> > > > Progress:
> > > > SwiftScript trace: 10gs-1
> > > > SwiftScript trace: 1a1u-1
> > > > SwiftScript trace: 1m3g-1
> > > > SwiftScript trace: 1a1x-1
> > > > SwiftScript trace: 1a1m-1
> > > > SwiftScript trace: 1a12-1
> > > > SwiftScript trace: 1m62-1
> > > > SwiftScript trace: 1a22-1
> > > > SwiftScript trace: 121p-1
> > > > SwiftScript trace: 1a4p-1
> > > > SwiftScript trace: 1m6b-1
> > > > SwiftScript trace: 1m7b-1
> > > > SwiftScript trace: 1m9i-1
> > > > SwiftScript trace: 1mi1-1
> > > > SwiftScript trace: 1m6b-2
> > > > SwiftScript trace: 1a22-2
> > > > SwiftScript trace: 1mfg-1
> > > > SwiftScript trace: 1m9j-1
> > > > SwiftScript trace: 1a1w-1
> > > > SwiftScript trace: 1mdi-1
> > > > SwiftScript trace: 1mq1-1
> > > > SwiftScript trace: 1mp1-1
> > > > SwiftScript trace: 1mq0-1
> > > > SwiftScript trace: 1mk3-1
> > > > SwiftScript trace: 1mj4-1
> > > > SwiftScript trace: 1mil-1
> > > > SwiftScript trace: 1mr1-1
> > > > SwiftScript trace: 1nbq-1
> > > > SwiftScript trace: 1mr8-1
> > > > SwiftScript trace: 1mr1-2
> > > > SwiftScript trace: 1n4m-2
> > > > SwiftScript trace: 1n83-1
> > > > SwiftScript trace: 1mm2-1
> > > > SwiftScript trace: 1nd7-1
> > > > SwiftScript trace: 1nm8-1
> > > > SwiftScript trace: 1n4m-3
> > > > SwiftScript trace: 1nfi-2
> > > > SwiftScript trace: 1nou-2
> > > > SwiftScript trace: 1nou-1
> > > > SwiftScript trace: 1nfi-1
> > > > SwiftScript trace: 1o5e-1
> > > > SwiftScript trace: 1o6u-2
> > > > SwiftScript trace: 1nty-1
> > > > SwiftScript trace: 1mx3-1
> > > > SwiftScript trace: 1n3u-2
> > > > SwiftScript trace: 1muz-1
> > > > SwiftScript trace: 1o86-1
> > > > SwiftScript trace: 1n3u-1
> > > > SwiftScript trace: 1oa8-1
> > > > SwiftScript trace: 1oc0-1
> > > > Progress: uninitialized:3
> > > > Progress: Initializing:1311 Selecting site:1189
> > > > Progress: Selecting site:2499 Initializing site shared
> > > > directory:1
> > > > Progress: Selecting site:2340 Initializing site shared
> > > > directory:1
> > > > Stage in:159
> > > > Progress: Stage in:2486 Submitting:14
> > > > Progress: Stage in:1712 Submitting:787 Submitted:1
> > > > queuedsize > 0 but no job dequeued. Queued: {}
> > > > java.lang.Throwable
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:253)
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:521)
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)
> > > > queuedsize > 0 but no job dequeued. Queued: {}
> > > > java.lang.Throwable
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:253)
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:521)
> > > > at
> > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)
> > > > login1$ finger kelly
> > > >
> > > >
> > > > Logs are on CT net in /home/wilde/mp/mp04:
> > > > cp ftdock-20110218-2137-v87vupcc.log out.pdb.all.00 ~/mp/mp04/
> > > >
> > > > - Mike
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > There was a bug in the block allocation scheme that would
> > > > > cause
> > > > > blocks
> > > > > to be kept, in the long run, at about half of what would
> > > > > normally be
> > > > > necessary. This included shutting down perfectly good blocks
> > > > > that
> > > > > could
> > > > > be used for jobs. The effect was more dramatic when the
> > > > > maximum
> > > > > block
> > > > > size was 1.
> > > > >
> > > > > I committed a fix for this in the stable branch (cog r3052).
> > > > > If
> > > > > you've
> > > > > experienced the above, you could give this a try. It would
> > > > > also be
> > > > > helpful if you gave it a try anyway, just to check if things
> > > > > are
> > > > > going
> > > > > ok.
> > > > >
> > > > > Mihael
> > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory