[petsc-dev] use of hash table vs array in various places in PETSc

Aron Ahmadia aron.ahmadia at kaust.edu.sa
Mon Sep 19 15:13:55 CDT 2011


It has to put the arguments in registers to make a function call. A static
function call (address known at compile time) has essentially no overhead
(like 1 cycle) after the arguments are positioned. Sometimes the compiler
can avoid positioning the registers specially for the function call (e.g. it
just uses those registers for earlier computation). If that bit of code is
under register pressure, it will have to save and restore some registers. If
you inline the function, there will probably be some registers used anyway,
so it can be ambiguous, especially depending on the size of the function
(e.g. if one side of the "if" is large).

A static function tail-calling a function pointer with the same arguments
compiles to mov, mov, jmp. I.e.

int foo(obj,a,b,c) {
  return obj->ops->foo(obj,a,b,c);
}

The mov is cheap as long as the object is in cache (usually you're going to
do something with it anyway), but the indirect function call takes around 6
cycles due to a partial pipeline stall. You see a similar cost for a "bad"
branch of an if statement. (The pipeline usually predicts that backward
conditional jumps will be taken, but forward jumps will not. The compiler
has to make a decision when laying out the code, so there will always be a
bad branch. Virtual calls are symmetric and if you take the same branch many
times, all modern CPUs will predict it correctly regardless of which side is
taken, so the cost of the indirect call will be mostly hidden.)

Different CPU generations will have different performance if you have a
bunch of heterogeneous objects and it doesn't always get better. Penryn
would get very good performance with a bunch of objects where type was index
mod 3, but Sandy Bridge is about 5 times worse. This huge variation mostly
goes away if the objects are mostly homogeneous. In the PetscTable case, it
is just one object  and the indirect call is always the same.

Note that if you make the indirect call, there is no advantage to putting
the implementations in a header (the compiler wasn't going to inline them
anyway).


Jed, thanks for the detailed discussion.  I'm sure I picked some of this
stuff up somewhere but I don't think it stuck.

On the BG/P, if I did the math right you may spend on the order of 5-10 more
cycles getting the arguments in place and calling the function with virtual
dispatch then a hard-coded dispatch.  In a machine like BG/Q some of this
will be amortized by the symmetric multi-threading: the hardware execution
units will be oversubscribed by the hardware threads and it will be able to
switch contexts to amortize the cost of the stall.

In comparison, an L1 cache hit costs on the order of 2 cycles to retrieve a
piece of data.

An L1 cache miss will cost on the order of ~50 cycles.
Miss the L3 cache and the penalty is ~100 cycles.

BG/P has in-order superscalar execution, you *will* stall on that cache miss
because the compiler will not do you any favors in terms of separating the
load of the data from its use.

Cheers,
Aron

>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110919/edf55d20/attachment.html>


More information about the petsc-dev mailing list