[Swift-devel] Swift and BGP plots

Mihael Hategan hategan at mcs.anl.gov
Tue Oct 27 00:35:47 CDT 2009


On Mon, 2009-10-26 at 23:19 -0500, Ioan Raicu wrote:
> 
> 
> Mihael Hategan wrote: 
> > On Mon, 2009-10-26 at 16:36 -0500, Ioan Raicu wrote:
> > [...]
[...]
> >   
> OK. So which plot/data should I be looking at to get the summary of
> the per tasks performance?

None. You calculate the efficiency based on the total time, individual
task time and number of nodes.

> > > > 64k jobs, 4000 workers:
> > > > http://www.mcs.anl.gov/~hategan/report-dc-4000/
> > > >   
> > > >       
> > > Shortest event (s): 106.119999885559
> > > Longest event (s): 1246.60699987411
> > > Mean event duration (s): 334.987874176266
> > > Standard deviation of event duration (s): 290.212811366649
[...]
> In the case of the above 18%, I took 60 / 334 ~ 0.18 = 18%

See http://en.wikipedia.org/wiki/Speedup

Efficiency is speedup divided by number of cores. You cannot infer
things from the mean duration because it says nothing about the degree
of parallelism. I.e. you're not interested in how fast individual things
are going, but how fast all things are going overall. That's why you can
have a crappy-cpu machine like the BGP be the in the top 10. Allow me to
paint this:

Scenario 1: [-1-][-2-][-1-][-2-]
Scenario 2: [---1----][---2----]

(how two tasks are scheduled by a scheduler on one CPU).

Efficiency is the same in both cases. In scenario 1 average task
duration is 15 characters, in scenario 2 it's 10 characters. In both
cases the raw task duration is 10 characters.

[...]
> >   
> I see the block utilization near 100% all the time,

Right. That's because it measures the wrapper time, not the wrapped job
time (some time is spent doing whatever the wrapper does). That just
says that the job-to-worker dispatch algorithm in the coasters works ok
with that load.

>  so that doesn't seem to match the other data I saw.

They measure different things. But if you calculate the efficiency the
proper way, you'll see that they are closer.

> > 2. Multiply 60s with the number of jobs (65535), divide by the number of
> > workers (6*1024) and then by the total time since the first job starts
> > to when the last job finishes (or you could choose the middle of the
> > ramp-up to the middle of the ramp-down to get some sort of amortized
> > efficiency). That gives you about 91% end-to end and 96% amortized. Or
> > you could divide by the total time, including swift startup, partition
> > boot time, etc. to get 64%.
> >   
> 65535*60/(6*1024) ~ 640 sec. I see the end-to-end time being about
> 1300 sec, or 1100 sec if we look at just Karajan. The 64% efficiency
> is in the ballpark, but I don't see where the 91% and 96% are coming
> from. 

I think you're mixing the runs. Sorry I didn't make it more clear, but
dc-4000 is the 4*1024 core run and dc-6000 is the 6*1024 core run. So
you're dividing the 4*1024 core speedup by 6*1024 which gives you 2/3 of
the efficiency. Multiply 64% by 3/2 to get the proper number back.





More information about the Swift-devel mailing list