<div dir="ltr"><div><div>It has to put the arguments in registers to make a function call. A static function call (address known at compile time) has essentially no overhead (like 1 cycle) after the arguments are positioned. Sometimes the compiler can avoid positioning the registers specially for the function call (e.g. it just uses those registers for earlier computation). If that bit of code is under register pressure, it will have to save and restore some registers. If you inline the function, there will probably be some registers used anyway, so it can be ambiguous, especially depending on the size of the function (e.g. if one side of the "if" is large).</div>

<div><br></div><div>A static function tail-calling a function pointer with the same arguments compiles to mov, mov, jmp. I.e.</div><div><br></div><div>int foo(obj,a,b,c) {</div><div>  return obj->ops->foo(obj,a,b,c);</div>

<div>}</div><div><br></div><div>The mov is cheap as long as the object is in cache (usually you're going to do something with it anyway), but the indirect function call takes around 6 cycles due to a partial pipeline stall. You see a similar cost for a "bad" branch of an if statement. (The pipeline usually predicts that backward conditional jumps will be taken, but forward jumps will not. The compiler has to make a decision when laying out the code, so there will always be a bad branch. Virtual calls are symmetric and if you take the same branch many times, all modern CPUs will predict it correctly regardless of which side is taken, so the cost of the indirect call will be mostly hidden.)</div>

<div><br></div><div>Different CPU generations will have different performance if you have a bunch of heterogeneous objects and it doesn't always get better. Penryn would get very good performance with a bunch of objects where type was index mod 3, but Sandy Bridge is about 5 times worse. This huge variation mostly goes away if the objects are mostly homogeneous. In the PetscTable case, it is just one object  and the indirect call is always the same.</div>

<div><br></div><div>Note that if you make the indirect call, there is no advantage to putting the implementations in a header (the compiler wasn't going to inline them anyway).</div></div><div><br></div><div><br></div>

<div>Jed, thanks for the detailed discussion.  I'm sure I picked some of this stuff up somewhere but I don't think it stuck.</div><div><br></div><div>On the BG/P, if I did the math right you may spend on the order of 5-10 more cycles getting the arguments in place and calling the function with virtual dispatch then a hard-coded dispatch.  In a machine like BG/Q some of this will be amortized by the symmetric multi-threading: the hardware execution units will be oversubscribed by the hardware threads and it will be able to switch contexts to amortize the cost of the stall.</div>

<div><br></div><div>In comparison, an L1 cache hit costs on the order of 2 cycles to retrieve a piece of data.</div><div><br></div><div>An L1 cache miss will cost on the order of ~50 cycles.<br>Miss the L3 cache and the penalty is ~100 cycles.</div>

<div><br></div><div>BG/P has in-order superscalar execution, you *will* stall on that cache miss because the compiler will not do you any favors in terms of separating the load of the data from its use.</div><div><br></div>

<div>Cheers,</div><div>Aron<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div><br></div></blockquote></div></div></div>