[petsc-dev] use of hash table vs array in various places in PETSc

Mon Sep 19 14:39:30 CDT 2011

On Mon, Sep 19, 2011 at 17:31, Barry Smith <bsmith at mcs.anl.gov> wrote:

> Of course we would not have a string lookup but it would be the
> object->function(...) type of virtual function we use; I don't see how any
> compiler would ever be able to inline this; it seems to me that this might
> trash the instruction cache or perhaps not, if it doesn't trash the
> instruction cache then maybe it would not be anymore expensive then the if
> () test? But it does have to push the arguments on the stack before the
> function call and clean them off afterwards (though presumable cleaning them
> off is basically free) or will it be smart enough to use registers for the
> arguments (but then would it back up the previous values in the registers
> which takes time, or would it ..... or would it ..... . Hmm so maybe just
> using the two classes/virtual function approach is fine and I shouldn't do
> the uglier if () stuff??? My gut feeling (which is likely wrong) is that
> replacing an array access with a function call is always a bad idea :-(.

It has to put the arguments in registers to make a function call. A static
function call (address known at compile time) has essentially no overhead
(like 1 cycle) after the arguments are positioned. Sometimes the compiler
can avoid positioning the registers specially for the function call (e.g. it
just uses those registers for earlier computation). If that bit of code is
under register pressure, it will have to save and restore some registers. If
you inline the function, there will probably be some registers used anyway,
so it can be ambiguous, especially depending on the size of the function
(e.g. if one side of the "if" is large).

A static function tail-calling a function pointer with the same arguments
compiles to mov, mov, jmp. I.e.

int foo(obj,a,b,c) {
  return obj->ops->foo(obj,a,b,c);
}

The mov is cheap as long as the object is in cache (usually you're going to
do something with it anyway), but the indirect function call takes around 6
cycles due to a partial pipeline stall. You see a similar cost for a "bad"
branch of an if statement. (The pipeline usually predicts that backward
conditional jumps will be taken, but forward jumps will not. The compiler
has to make a decision when laying out the code, so there will always be a
bad branch. Virtual calls are symmetric and if you take the same branch many
times, all modern CPUs will predict it correctly regardless of which side is
taken, so the cost of the indirect call will be mostly hidden.)

Different CPU generations will have different performance if you have a
bunch of heterogeneous objects and it doesn't always get better. Penryn
would get very good performance with a bunch of objects where type was index
mod 3, but Sandy Bridge is about 5 times worse. This huge variation mostly
goes away if the objects are mostly homogeneous. In the PetscTable case, it
is just one object  and the indirect call is always the same.

Note that if you make the indirect call, there is no advantage to putting
the implementations in a header (the compiler wasn't going to inline them
anyway).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110919/32e5b18b/attachment.html>