Aleksey Shipilёv, @shipilev, aleksey@shipilev.net
This post is very long, but please try to read it without skipping. The text naturally builds up on the observations done earlier in the same text. That’s also why the first part is rather boring and slow. Buckle up, and read. If you are running out of steam, take a break, and pick up you where you left. Please read it in entirety before asking questions! |
There is a much shorter "Too fast, too megamorphic" post from Richard Warburton (@RichardWarburto), if you need spoilers and warmup. Vyacheslav Egorov (@mraleph) had a high-level post about the similar stuff in "What’s up with monomorphism?". |
Preface
Programming languages like Java provide the facilities for subtyping/polymorphism as one of the ways to construct modular and reusable software. This language choice naturally comes at a price, since there is no hardware support for virtual calls, and therefore runtimes have to emulate this behavior. In many, many cases the performance of method dispatch is not important. Actually, in a vast majority of cases, the low-level performance concerns are not the real concerns.
However, there are cases when method dispatch performance is important, and there you need to understand how dispatch
works, what runtimes optimize for you, and what you can do to cheat and/or emulate similar behavior in your code.
For example, in the course of String Compression work, we were faced
with the problem of selecting the coder for a given String. The obvious and highly maintainable approach of creating
a Coder
interface, a few implementations, and dispatching the virtual calls over it, had met some performance problems
on the very tiny benchmarks. Therefore, we needed to contemplate something better. After a few experiments, this post
was born as a reference for others who might try to do the same. This post also tangentially touches the inlining of
virtual calls, as the natural thing during the optimization.
As a good tradition, we will take some diversions into benchmarking methodology and general low-level performance engineering, so that even though the post itself is targeted at platform people, the general crowd can still learn a few tricks. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.
This post also assumes a good understanding of Java, Java bytecode, compilers, runtimes, and x86 assembly. Many readers complain my posts lack completeness in the sense that I omit the details how to arrive to a particular conclusion. This time I decided to use lots of notes to highlight how one can decipher the compiler decisions, how to read the generated assembly, etc. It should be fun and educational, but you may also ignore the parts formatted like callouts, if you don’t have time. |
1. Problem
We can formalize the problem as follows. Suppose we have a class Data
:
public class Data {
byte[] data;
}
…and we want to do different things based on that data. Suppose we have N versions of code, each of those versions
provide some sort of meaning for the data
. In String Compression, for example, data
can be either 1-byte-encoded
array, or 2-byte-encoded array. Something somewhere should provide a meaning for that data; in other words, demangle
it and/or do useful work. Suppose, for example, we have a "coder" abstraction that does something along the lines of:
public class Coder1 {
int work(byte[] data) { return omNomNom(data); }
}
public class Coder2 {
int work(byte[] data) { return cIsForCookie(data); }
}
The question is, what is the best way to implement different coders and dispatch over them?
2. Experimental Setup
2.1. Hardware
While the exact code generation details surely differ among the platforms, in hindsight, most of the behaviors we are about to show are platform-agnostic, and done in high-level optimizers. Because of that, we simplify our lives and run the tests on a single 1x4x2 i7-4790K 4.0 Ghz, Linux x86_64, JDK 8u40 EA. Suspicious readers are welcome to reproduce the results on their respective platforms.
2.2. Cases To Try
Granted, you may just inline the coder implementations straight into Data
, but that’s a questionable practice, especially if you want to have
multiple coder implementations. Therefore, we consider these ways to implement coders:
-
Static: Make static implementations of Coders, all with static methods.
-
Dynamic_Interface: Make
Coder
a proper interface, and provide the implementations. -
Dynamic_Abstract: Same as above, but make
Coder
the abstract superclass, and provide the implementations.
All these three methods of implementing the behavior need some way to encode it in the Data
itself. We can come up
with four schemes of such an encoding:
-
ID_Switch: Store a byte ID, and select the coder by switching over it
-
ID_IfElse: Store a byte ID, and select the coder by if-else-ing over it
-
Bool_IfElse: Store a boolean, and select the coder by if-elsing over it. (Works only for N=2)
-
Ref: Store a
Coder
reference, and do a virtual call.
There are also other ways to encode, e.g. storing the |
The choice of the encoding scheme also has to take the footprint into the consideration. Additional field in the |
Combining these behaviors and selection schemes, we have quite a few variants already. Thankfully, not all of them are required for a concrete N. However, it would be interesting to dissect each of the relevant ones to understand how runtimes deal with code like this.
2.3. Benchmarks
The source code for the benchmarks is available here. Since we are dealing with a very narrow VM effect, let me explain a few things about the benchmark. This is how a single benchmark case looks like:
public class Test {
@Param("100")
private int count;
private Data[] datas;
@Setup
public void setup() {
datas = new Data[count];
Random r = new Random();
for (int c = 0; c < count; c++) {
byte[] contents = new byte[10];
r.nextBytes(contents);
datas[c] = new Data(r.nextInt(2), contents);
}
}
@Benchmark
public void dynamic_Interface_Ref() {
Data[] l = datas;
int c = count;
for (int i = 0; i < c; i++) {
l[i].do_Dynamic_Interface_Ref();
}
}
}
public class Coder0 implements Coder {
public int work(byte[] data) {
return data.length; // something light-weight
}
}
public class Coder1 implements Coder {
public int work(byte[] data) {
return data.length; // something light-weight
}
}
public class Data {
private static final Coder0 coder0 = new Coder0();
private static final Coder1 coder1 = new Coder1();
private final Coder coder;
private final byte id;
private final byte[] data;
public Data(byte id, byte[] data) {
this.id = id;
this.data = data;
this.coder = interface_ID_Switch();
}
private Coder interface_ID_Switch() {
switch (id) {
case 0: return coder0;
case 1: return coder1;
default:
throw new IllegalStateException();
}
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Dynamic_Interface_Ref() {
return coder.work(data);
}
}
A few tricks worth mentioning here:
-
We do a loop over a few
Data
classes. That will be important when we start to look into dealing with different coders: runtime should be "poisoned" with the different types of coders at the call site. In the example above, the test has two types of coders, with ID=0 and ID=1. -
We hand-optimize the loop in
@Benchmark
method, because its performance is important when we deal with the interpreter. -
The benchmark loop does not consume the payload result. This is done to make the nanobenchmark amplify the costs of the payload itself, not the infrastructure costs. This practice is DANGEROUS and ignores the JMH recommendations. This trick can only be done by the trained professionals, who would then verify it does not obliterate the benchmark, due to e.g. dead code elimination.
-
The payload method (
do_Dynamic_Interface_Ref
in this example) is annotated with a special JMH hint that prevents inlining of payload into the benchmark loop. This is an important prerequisite for avoiding dead code elimination, at least on HotSpot today. It also helps to separate the payload body from the remaining code in the assembly dump.
2.4. VM Modes
By default, HotSpot VM these days provides the tiered compilation: first, the code is interpreted, then it’s compiled with a baseline compiler (also known as "client" compiler, or C1), and finally it is compiled with an aggressively optimizing compiler (also known as "server" compiler, or C2).
The actual pipeline is more complicated. VM constants for tiers of compilation shine some light on what each level does. level=1 is pure C1, level=4 is pure C2. Levels in-between are C1 with profiling. The transitions between levels are governed by an advanced policy, and you can talk about them forever. |
While most people care about the performance in C2, since the code that gets aggressively optimized is by construction the same code that consumes lots of time, we also care about the performance in pathological modes where the code hasn’t yet made it to C2 compilation.
3. Single Type
If you are inexperienced with walls of assembly, it is a good idea to pay attention to this part. |
Let us start with scenario with only a single coder. While this scenario is superficial, because you might as well inline the coder implementation into the data class itself, it still serves as an interesting basic case that we want to untangle before diving into more complicated cases. Here, we only concern ourselves with Ref style of encoding, since other styles have almost no meaning here.
The JMH benchmarks consider a single Coder0
that extends AbstractCoder, and that’s
the only subclass of this abstract class. Coder0
also implements Coder
interface, and there is only a single interface implementer. Even though we have a singular Coder0
that implements/extends
both the abstract class and interface, we try to call for its work either via abstract class (invokevirtual), interface
(invokeinterface), or the static method (invokestatic).
public abstract class AbstractCoder {
public abstract int abstractWork(byte[] data);
}
public interface Coder {
int work(byte[] data);
}
public class Coder0 extends AbstractCoder implements Coder {
public static int staticWork(byte[] data) {
return data.length;
}
@Override
public int work(byte[] data) {
return data.length;
}
@Override
public int abstractWork(byte[] data) {
return data.length;
}
}
3.1. C2
Without further ado, the results are:
Benchmark (count) Mode Cnt Score Error Units
One.dynamic_Abstract_Ref 10000 avgt 50 30.168 ± 0.602 us/op
One.dynamic_Interface_Ref 10000 avgt 50 31.391 ± 0.067 us/op
One.static_Ref 10000 avgt 50 29.171 ± 0.385 us/op
We can immediately spot the difference: static case is faster than dynamic, and interface case is slower
than abstract class. While the difference seems minuscule, in some cases, like ours, it might matter a great deal. Let’s
figure out why the performance is different. To do that, we employ JMH’s -prof perfasm
profiler, which maps the Linux
perf event sampling onto disassembly.
There are other ways to get the low-level profiling and disassembly, including but not limited to
dealing with VM
yourself,
using Solaris Studio Performance Analyzer,
or JITWatch and others. Every approach has its advantages and its limits, learn
how to employ the right tool for a particular case. JMH |
Editorial note. The assembly listings you will see in this post are slightly different from what you will see in the actual
assembly dump from the VM. It would be much larger, but thanks to |
3.1.1. C2: static_Ref
[Verified Entry Point]
0.02% 0.02% mov %eax,-0x14000(%rsp)
9.18% 6.63% push %rbp
0.01% sub $0x10,%rsp
mov 0x10(%rsi),%r11d ; get field $data
9.54% 7.90% mov 0xc(%r12,%r11,8),%eax ; Coder0.staticWork() body
10.79% 10.78% add $0x10,%rsp ; epilogue and return
pop %rbp
test %eax,0x15f7c4e0(%rip)
15.56% 21.52% retq
This method yields a simple and straight-forward code, and that is not surprising: we just call the static method. Static methods are also statically bound, which means we don’t need to resolve what method should we call in runtime. We know exactly what methods we are about to call statically, at compile time. The same thing applies to our own private methods, they are also statically resolvable.
How did we figure a thing about |
Another interesting question is about null checks. Java spec requires to check the |
The only remaining part, other than |
3.1.2. C2: dynamic_Abstract_Ref
[Verified Entry Point]
0.44% 0.65% mov %eax,-0x14000(%rsp)
7.93% 6.68% push %rbp
0.23% 0.24% sub $0x10,%rsp
1.46% 0.54% mov 0x10(%rsi),%r11d ; get field $data
7.96% 7.25% cmp 0x18(%rsi),%r12d ; null checking $abstractCoder
je NULL_CHECK ; $abstractCoder is null, bail!
2.05% 1.94% mov 0xc(%r12,%r11,8),%eax ; Coder0.abstractWork() body, arraylength
13.19% 14.35% add $0x10,%rsp ; epilogue and return
0.33% 0.28% pop %rbp
3.08% 3.91% test %eax,0x1642359a(%rip)
11.75% 15.78% retq
NULL_CHECK:
<omitted>
In human words, it loads the data
field, it does the null-check for abstractCoder
, because we are about
to call a virtual method on it, and then we have the inlined body of Coder0.abstractWork()
. The cost of the null check
is what we see as the difference against a plain static
call.
How did we figure out the thing about null checking? Well, we already know |
One can wonder: what happens if we pulled a different coder from the field and not |
Another puzzling thing about this: Java supports dynamically loaded classes! What happens if we compile the current
code speculatively, but then some bastards receive another subclass of |
3.1.3. C2: dynamic_Interface_Ref
[Verified Entry Point]
0.03% 0.02% mov %eax,-0x14000(%rsp)
8.87% 7.69% push %rbp
0.06% 0.03% sub $0x20,%rsp
0.01% 0.02% mov 0x10(%rsi),%r10d ; get field $data
9.20% 10.68% mov 0x14(%rsi),%r8d ; get field $coder
0.01% 0.02% mov 0x8(%r12,%r8,8),%r11d ; get the class word for $coder
14.04% 18.00% cmp $0xf80102f6,%r11d ; compare it with Coder0
jne TYPE_CHECK_FAILED ; not our coder, bail!
7.50% 10.19% mov 0xc(%r12,%r10,8),%eax ; Coder0.work() body
1.74% 1.89% add $0x20,%rsp ; epilogue and return
pop %rbp
0.04% 0.04% test %eax,0x15fdfb4e(%rip)
10.31% 13.05% retq
TYPE_CHECK_FAILED:
<omitted>
Notice here that the CHA is not playing on our side. Granted, we could have analyzed the same for interfaces, figure out the interface has only a single implementer, etc., but the reality is that real interfaces usually have lots of implementors, and so implementing and supporting this in the VM does not pass a reasonable cost-benefit analysis.
Therefore, the compiled code has to check the coder type, and the performance difference against simple static
case
is explained by this. It is also less cheap than null-checking the coder instance, since we need to pull and compare
the actual classword.
How did we figure out the classword is being loaded? Remember the note above, we know that we are reading something
at offset |
VM does the type checks by comparing the class word with a pre-compiled constant. That constant is actually the canonical address of native VM structure that mirrors the class. Since it’s canonical, and also because VM takes care of not moving it, we can compile the needed address right into the generated code, reducing these checks to just pointer comparison. |
3.2. C1
As noted above, the generated code passes a few tiers of compilation before it reaches the heavily-optimized version.
Sometimes what you transiently have before reaching the optimal performance also matters, this is why we should
also look into C1, telling VM to stop the tiered pipeline at level 1, with -XX:TieredStopAtLevel=1
.
Ideally, we would want to limit the tiered compilation level only for a specific benchmark method, but it is simpler to run the entire code on lower level. This is valid since we don’t compare the performance of C1 vs C2, but merely study how a concrete compiler behaves. |
Benchmark (count) Mode Cnt Score Error Units
One.dynamic_Abstract_Ref 10000 avgt 50 41.595 ± 0.049 us/op
One.dynamic_Interface_Ref 10000 avgt 50 36.562 ± 0.058 us/op
One.static_Ref 10000 avgt 50 26.594 ± 0.051 us/op
We can see a slightly different distribution as opposed to the C2 case: the static case is still significantly faster, but now the abstract case is significantly slower that interface one. What gives?
3.2.1. C1: static_Ref
[Verified Entry Point]
8.79% 8.26% mov %eax,-0x14000(%rsp)
2.14% 1.57% push %rbp
0.28% 0.29% sub $0x30,%rsp
9.23% 7.44% mov 0x10(%rsi),%eax ; get field $data
0.77% 0.63% shl $0x3,%rax ; unpack it
0.35% 0.37% mov 0xc(%rax),%eax ; Coder0.work() body, arraylength
13.30% 10.77% add $0x30,%rsp ; epilogue and return
1.16% 0.61% pop %rbp
0.53% 0.32% test %eax,0x1650807f(%rip)
13.50% 13.20% retq
These cases are again remarkably simple, because we know the call target exactly, as we did in the same C2 case.
The compressed references packing/unpacking code may look like this as well. Notice it is almost the same as in
|
The minute differences in code generation between compilers is not a surprise. What’s surprising is that C2-ish instruction choice is marginally slower than C1-ish instruction choice: C2: 29.575 ±(99.9%) 1.116 us/op C1: 26.347 ±(99.9%) 0.178 us/op Following up on this difference will be interesting, but outside the scope of this work. This observation is another example of the importance of very careful comparisons for the code coming through different toolchains. Read "Java vs. Scala: Divided We Fail" for a funnier example of the same effect. |
3.2.2. C1: dynamic_Abstract_Ref
dynamic_Abstract_Ref():
[Verified Entry Point]
5.22% 3.71% mov %eax,-0x14000(%rsp)
1.12% 0.49% push %rbp
4.73% 4.15% sub $0x30,%rsp
0.20% 0.15% mov 0x18(%rsi),%edi ; get field $abstractCoder
0.73% 0.03% shl $0x3,%rdi ; unpack it
5.44% 4.78% mov 0x10(%rsi),%edx ; get field $data
0.02% shl $0x3,%rdx ; unpack it
0.16% 0.13% mov %rdi,%rsi
0.72% 0.04% mov $0xffffffffffffffff,%rax
4.89% 5.88% callq 0x00007f7ae1046220 ; VIRTUAL CALL to abstractWork()
6.01% 6.08% add $0x30,%rsp
0.45% 0.40% pop %rbp
0.28% 0.11% test %eax,0x17737229(%rip)
4.91% 6.09% retq
abstractWork():
[Verified Entry Point]
0.39% 0.18% mov %eax,-0x14000(%rsp)
5.39% 5.52% push %rbp
0.20% 0.13% sub $0x30,%rsp
0.53% 0.55% mov 0xc(%rdx),%eax ; arraylength
6.10% 5.79% add $0x30,%rsp ; epilogue and return
0.18% 0.09% pop %rbp
0.35% 0.46% test %eax,0x17736ee6(%rip)
0.45% 0.32% retq
First of all, you have to notice the abstractWork
method is not inlined. This is because C1 is not able to
figure out the static target for the call: the call target is not trivial, and the CHA is not working for abstract
classes here.
One can untangle that as follows, run with net.shipilev.one.One::do_Dynamic_Abstract_Ref (12 bytes) @ 8 net.shipilev.one.AbstractCoder::abstractWork (0 bytes) no static binding Now, if you search the HotSpot source for "no static binding", you will find a relevant part of C1 GraphBuilder, which implies the inlining of this method
will happen if it’s static/private/dynamic, or otherwise final. Placing the |
Somebody asked me what’s that |
3.2.3. C1: dynamic_Interface_Ref
[Verified Entry Point]
7.63% 5.23% mov %eax,-0x14000(%rsp)
0.09% 0.22% push %rbp
7.13% 10.31% sub $0x30,%rsp
0.35% 0.10% mov 0x14(%rsi),%eax ; get field $coder
0.05% 0.16% shl $0x3,%rax ; unpack it
6.84% 10.14% mov 0x10(%rsi),%esi ; get field $data
0.16% 0.41% shl $0x3,%rsi ; unpack it
0.13% 0.11% cmp $0x0,%rax ; null-check
je WORK ; bail to implicit null-check handling
0.09% 0.06% mov $0x7c00817b0,%rcx ; load Coder0 metadata
6.50% 7.65% mov 0x8(%rax),%edx ; load classword for $coder
0.31% 0.38% shl $0x3,%rdx ; unpack it
0.49% 0.61% cmp 0x38(%rdx),%rcx ; check if $coder is actually Coder0
jne TYPE_CHECK_FAILED ; not? bail.
27.33% 30.35% jmpq WORK ; GO HOME COMPILER, YOU ARE DRUNK
WORK:
0.31% 0.25% mov %rax,%rdi
0.07% 0.04% cmp (%rax),%rax ; implicit null check
0.05% mov 0xc(%rsi),%eax ; Coder0.work() body, arraylength
7.03% 7.82% add $0x30,%rsp ; epilog and return
0.19% 0.30% pop %rbp
0.08% 0.06% test %eax,0x1639b908(%rip)
0.15% 0.08% retq
TYPE_CHECK_FAILED:
<omitted>
The interface call is inlined more or less nicely, but there is a significant dance for type checking. The largest part
of this method seems to wait on cmp 0x38(%rdx),%rcx
and associated jumps, although it’s hard to tell what’s the problem
exactly: it may be a cache miss on getting 0x38(%rdx)
(though unlikely in a nano-benchmark), or some weird pipeline effect
with two jumps following each other. Either way, the jmpq
there seems suspicious, since we might as well fall-through.
This can be fixed with another pass of peephole optimizer that could at least nop
out the jmpq
, but remember we are
talking about C1, which goal is to produce the code fast, even at the expense of it’s quality.
How did we figure |
3.3. Interpreter
The interpreter tests are interesting because you don’t expect an optimizing compiler to come and save you. That’s
certainly true, and if you run the tests with -Xint
, you will see something like this:
Benchmark (count) Mode Cnt Score Error Units
One.dynamic_Abstract_Ref 10000 avgt 50 1149.768 ± 7.131 us/op
One.dynamic_Interface_Ref 10000 avgt 50 1154.198 ± 11.030 us/op
One.static_Ref 10000 avgt 50 1137.694 ± 7.405 us/op
It would seem that the interpreter is so slow, the cost of the different call itself is drowning there. However, if
you profile the interpreter with something like JMH perfasm
, then you will see that we are spending the most of the
time doing stack banging. Working that around gives a nice performance
boost with -Xint -XX:StackShadowPages=1
gives:
Benchmark (count) Mode Cnt Score Error Units
One.dynamic_Abstract_Ref 10000 avgt 50 369.172 ± 2.024 us/op
One.dynamic_Interface_Ref 10000 avgt 50 388.112 ± 3.103 us/op
One.static_Ref 10000 avgt 50 344.355 ± 3.402 us/op
Meddling with |
As in C2 case, the static cases are faster, and interface case is slower than abstract one. Let us look into… the generated code! We wouldn’t spend much time here, but just outline a few key things.
As unusual as it sounds, our interpreter generates the code. For each bytecode instruction it generates one or more platform-specific stubs that together implement the abstract bytecode executing machine. With Futamura projections, one can easily argue this is actually a template compiler. The established term for this implementation is "template interpreter" though, because it does interpret the bytecode in a (weird, but) straight-forward way. The debate how would one call such an implementation further outlines that the line separating an interpreter and a compiler is very thin, if it exists at all. |
The assembly for the stubs are the "fun" reading, and we would not go there. Instead, we will do a differential analysis, since the stubs are roughly the same for all the cases, and only their time distributions differ.
This is static_Ref:
....[Hottest Methods (after inlining)]................................ 15.33% 21.46% <stub: invoke return entry points> 13.89% 8.23% <stub: method entry point (kind = zerolocals)> 12.77% 14.76% <stub: invokevirtual 182 invokevirtual> 10.07% 9.69% <stub: fast_aaccess_0 221 fast_aaccess_0> 8.42% 13.37% <stub: ireturn 172 ireturn> 7.10% 5.56% <stub: throw exception entrypoints> 5.02% 2.63% <stub: pop 87 pop> 4.14% 2.90% <stub: iload_3 29 iload_3> 3.95% 6.49% <stub: invokestatic 184 invokestatic> 3.61% 1.08% <stub: if_icmpge 162 if_icmpge> 2.39% 2.36% <kernel>
We have five large stubs of interest:
-
fast_aaccess_0
, for reading the array elements, remember our infrastructure code pulls the targets off an array -
invokevirtual
, for invoking the virtual payload calls; with no inlining we still have to make a call -
method entry point
, executed before entering the payload method body -
invoke return
, executed before leaving the payload method body -
invokestatic
, executing our payload; this stub is not very hot, because we spend most of the time dealing with larger problems in the infrastructure code itself.
For comparison, this is dynamic_Abstract_Ref:
....[Hottest Methods (after inlining)]................................ 18.98% 16.18% <stub: invokevirtual 182 invokevirtual> 16.82% 8.74% <stub: method entry print (kind = zerolocals)> 12.54% 20.40% <stub: throw exception entrypoints> 12.04% 10.56% <stub: fast_aaccess_0 221 fast_aaccess_0> 7.46% 12.45% <stub: invoke return entry points> 6.44% 6.24% <stub: ireturn 172 ireturn> 4.36% 6.94% <stub: pop 87 pop> 3.76% 2.69% <stub: iload_3 29 iload_3> 3.03% 0.60% <stub: if_icmpge 162 if_icmpge> 2.32% 2.27% <kernel>
It is different from the previous one in the sense that invokestatic
is gone, and replaced by invokevirtual
for
our abstract call, and we also have the throw exception
stub, which is here to handle possible NPEs. This difference
explains why static_Ref is faster — for the same reason it is faster in compiled versions. If the call receiver is
known, everything gets easier.
dynamic_Interface_Ref is roughly the same, but having invokeinterface
stub:
....[Hottest Methods (after inlining)]................................ 17.35% 18.01% <stub: invokeinterface 185 invokeinterface> 12.93% 8.52% <stub: method entry point (kind = zerolocals)> 11.66% 11.30% <stub: fast_aaccess_0 221 fast_aaccess_0> 11.62% 9.44% <stub: throw exception entrypoints> 9.87% 11.01% <stub: invokevirtual 182 invokevirtual> 6.74% 11.34% <stub: invoke return entry points> 6.38% 6.33% <stub: ireturn 172 ireturn> 3.60% 2.10% <stub: if_icmpge 162 if_icmpge> 3.43% 5.79% <stub: pop 87 pop> 2.62% 1.81% <stub: aload_1 43 aload_1> 2.49% 2.40% <stub: iload_3 29 iload_3> 2.23% 2.24% <kernel>
3.4. Discussion I
Looking at the case of a single coder, we can already spot a few things:
-
If you know a call receiver exactly at development/compile time, it is a good idea to call it statically. This practice actually coincides with good engineering: if the method behavior does not depend on the receiver state, it might as well be
static
to begin with. -
C2 will try to guess the receiver based on profile, but it has to do the defensive type checks and/or null checks, as mandated by the language specification. While this is more than enough in a vast majority of cases, the inherent costs of these mandatory checks are sometimes visible on a very high magnification.
-
C1 is sloppy when it comes to exploiting the static type information, as well as profile. This may affect warmup times and time-to-performance. It’s an open question if you can implement the C2-like smart moves, without compromising the compilation time.
To reiterate, the method invocation performance is not a concern most of the time, and the VM optimizations are actually closing the gap very tightly. We will see what happens when they can’t, as we go on.
4. Two Types
Now that we learned some basic stuff about the single type calls, let’s see how runtimes handle the calls with different targets. The simplest way to achieve that in benchmark would be to modify our single coder benchmark to segregate two Coders.
@Param({"0.0", "0.1", "0.5", "0.9", "1.0"})
private double bias;
private Data[] datas;
@Setup
public void setup() {
Random r = new Random(12345);
List<Data> ts = new ArrayList<Data>();
for (int c = 0; c < count; c++) {
byte[] contents = new byte[10];
r.nextBytes(contents);
byte id = (byte) (c < (bias * count) ? 1 : 0);
ts.add(new Data(id, contents));
}
Collections.shuffle(ts, r);
datas = ts.toArray(new Data[0]);
}
This further complicates benchmarking because now we have to decide what distribution of coders to measure.
Should it be 50/50? Should it be 1/42? In hindsight, we are going to introduce the bias
parameter which will say what
ratio does Coder0
consume in the distribution, and measure the benchmarks at 0.0, 0.1, 0.5, 0.9, and 1.0
ratios.
Measuring only a single |
4.1. Reference Tests
Again, let us "eat the elephant one bite at a time", and start from the tests that we are already familiar with, the Ref family.
static_Ref
is not applicable here anymore, because there is no way to select two coder implementations with static
methods and nothing else (we will see how auxiliary selectors perform later). That leaves dynamic_Interface_Ref
and
dynamic_Abstract_Ref
.
4.1.1. C2: dynamic_…_Ref
If we run these tests with C2, we will see:
Benchmark (bias) (count) Mode Cnt Score Error Units
Two.dynamic_Interface_Ref 0.0 10000 avgt 50 52.701 ± 0.470 us/op
Two.dynamic_Interface_Ref 0.1 10000 avgt 50 57.760 ± 0.542 us/op
Two.dynamic_Interface_Ref 0.5 10000 avgt 50 94.744 ± 0.035 us/op
Two.dynamic_Interface_Ref 0.9 10000 avgt 50 59.046 ± 1.595 us/op
Two.dynamic_Interface_Ref 1.0 10000 avgt 50 50.857 ± 0.252 us/op
Two.dynamic_Abstract_Ref 0.0 10000 avgt 50 54.419 ± 1.733 us/op
Two.dynamic_Abstract_Ref 0.1 10000 avgt 50 58.244 ± 0.652 us/op
Two.dynamic_Abstract_Ref 0.5 10000 avgt 50 95.765 ± 0.180 us/op
Two.dynamic_Abstract_Ref 0.9 10000 avgt 50 58.618 ± 0.843 us/op
Two.dynamic_Abstract_Ref 1.0 10000 avgt 50 52.412 ± 0.185 us/op
Notice a few things about these results already. First, the performance depends on the bias; we can already speculate why, but we will dive into the exact reasons a bit later. Second, the performance difference is symmetrical around 1/2 bias, which already gives you a few hints about what’s happening. Third, the abstract case and the interface case seem to perform the same.
All right, let us disassemble some stuff! Take dynamic_Interface_Ref
with bias = 0.0, that is, only Coder0
instances
are present in the code:
[Verified Entry Point]
0.03% 0.07% mov %eax,-0x14000(%rsp)
8.36% 7.19% push %rbp
0.03% 0.03% sub $0x20,%rsp
mov 0x10(%rsi),%r10d ; get field $data
8.83% 10.53% mov 0x14(%rsi),%r8d ; get field $coder
0.05% 0.03% mov 0x8(%r12,%r8,8),%r11d ; load $coder.<classword>
12.54% 16.32% cmp $0xf801040a,%r11d ; check $coder is Coder0
jne SLOW_PATH ; not? bail
6.87% 9.52% mov 0xc(%r12,%r10,8),%eax ; Coder0.work() body
2.72% 2.52% add $0x20,%rsp ; epilogue and return
0.02% pop %rbp
0.03% 0.02% test %eax,0x171b1d0e(%rip)
10.09% 11.50% retq
SLOW_PATH:
mov $0xffffffde,%esi ; handle the very rare case
mov %r8d,%ebp
mov %r10d,(%rsp)
callq 0x00007f4d310051a0 ; call into runtime <uncommon trap>
Notice anything? The code looks similar to what we had before, but this time we check the actual type of coder
, and
then call into runtime if we see something else, not Coder0
. The VM was actually smart enough to figure the only
type here is Coder0
.
It is sometimes enlightening to view the type profile per call site. It’s doable with net.shipilev.two.Data::do_Dynamic_Interface_Ref (14 bytes) @ 8 net.shipilev.two.Coder0::work (3 bytes) inline (hot) \-> TypeProfile (6391/6391 counts) = net/shipilev/two/Coder0 |
What’s an "uncommon trap"? That’s another part of VM-Code interface. When compiler knows some branch/case is unlikely,
it can emit a simple call back to VM instead of generating a whole lot of not-to-be-used code. This can greatly cut
the compile times, as well as the generated code footprint, and provide denser hot paths, e.g. in loops. In this case,
stepping on that uncommon trap may also mean our profile information about |
dynamic_Interface_Ref
with bias = 0.1 is more interesting:
[Verified Entry Point]
0.02% 0.12% mov %eax,-0x14000(%rsp)
7.98% 7.81% push %rbp
0.08% 0.07% sub $0x20,%rsp
0.64% 0.69% mov 0x10(%rsi),%r10d ; get field $data
7.62% 8.67% mov 0x14(%rsi),%r8d ; get field $coder
0.13% 0.03% mov 0x8(%r12,%r8,8),%r11d ; load $coder.<classword>
13.64% 19.45% cmp $0xf801040a,%r11d ; check $coder is Coder0
jne CODER_1 ; not? jump further
6.36% 8.34% mov 0xc(%r12,%r10,8),%eax ; Coder0.work() body
EPILOG:
1.21% 1.07% add $0x20,%rsp ; epilog and return
0.03% pop %rbp
0.51% 0.62% test %eax,0x1874c24e(%rip)
8.85% 11.10% retq
CODER_1:
1.18% 1.40% cmp $0xf8012585,%r11d ; check if $coder is Coder1
jne SLOW_PATH ; not? jump further
0.52% 0.58% mov 0xc(%r12,%r10,8),%eax ; Coder1.work() body
0.02% jmp EPILOG ; jump back to epilog
SLOW_PATH:
mov $0xffffffc6,%esi ; handle the very rare case
mov %r8d,%ebp
mov %r10d,(%rsp)
callq 0x00007f2b2d0051a0 ; call into runtime <uncommon trap>
Now we actually have the two real coder implementations. Notice, however, the codepath leading to Coder1
is pushed
out from the straight path, and the more frequent case of Coder0
is taking it’s place. If we were to run with
bias = 0.9, the places of Coder0
and Coder1
would reverse here, because profile says so, and code layouter
consults with it.
In other words, compilers use frequency-based basic block layout, that puts the basic blocks on a straight path based on the transition frequencies between the basic blocks. That has a handy effect for machines: the most frequent code is linear, which plays nicely with instruction caches and decoders. This also has a handy effect on humans: if you look at the compare-and-branch sequence in the generated code, then the unlikely branch will probably be fanned out, and the likely branch will follow through. |
Same test with bias = 0.5:
[Verified Entry Point]
1.80% 2.27% mov %eax,-0x14000(%rsp)
5.28% 5.51% push %rbp
0.47% 0.35% sub $0x20,%rsp
3.00% 3.95% mov 0x10(%rsi),%r10d ; get field $data
4.86% 6.28% mov 0x14(%rsi),%r8d ; get field $coder
0.10% 0.10% mov 0x8(%r12,%r8,8),%r11d ; load $coder.<classword>
13.92% 17.49% cmp $0xf8012585,%r11d ; check if $coder is Coder1
je CODER_1 ; jump further
4.96% 6.58% cmp $0xf801040a,%r11d ; check if $coder is Coder0
jne SLOW_PATH ; not? jump out
2.70% 2.93% mov 0xc(%r12,%r10,8),%eax ; Coder0.work() body
1.42% 0.93% jmp EPILOG ; jump to epilog
CODER_1:
3.78% 5.48% mov 0xc(%r12,%r10,8),%eax ; Coder1.work() body
EPILOG:
2.20% 1.77% add $0x20,%rsp ; epilog and return
2.11% 1.88% pop %rbp
2.21% 1.68% test %eax,0x15e9c93e(%rip)
4.41% 4.23% retq
SLOW_PATH:
mov $0xffffffc6,%esi
mov %r8d,%ebp
mov %r10d,(%rsp)
callq 0x00007f8737f6b1a0 ; call into runtime <uncommon trap>
Now, because of runtime profile affecting the block layout, this is arguably is a degenerate case, where the tiny swing
to either Coder0
or Coder1
case can affect the layout decision, potentially leading to a great run-to-run variance.
From performance standpoint, however, such a distribution means lots of mispredicted branches, which will affect the
performance on the modern CPUs.
For example, if we run under perf (JMH conveniently provides the support
with bias=0.0: 197,299,178,798 cycles # 3.977 GHz 323,821,144,134 instructions # 1.64 insns per cycle 47,692,252,113 branches # 961.278 M/sec 16,565,723 branch-misses # 0.03% of all branches bias=0.1: 197,078,370,315 cycles # 3.976 GHz 295,840,052,712 instructions # 1.50 insns per cycle 44,917,582,623 branches # 906.143 M/sec 836,907,280 branch-misses # 1.86% of all branches bias=0.5: 197,541,017,422 cycles # 3.977 GHz 183,698,062,858 instructions # 0.93 insns per cycle 31,674,184,694 branches # 637.693 M/sec 2,619,451,036 branch-misses # 8.27% of all branches |
One might wonder, why wouldn’t we sort the Third, if you have a large sorted array, then chances are you will trip the profiler half-way through processing the array, and therefore skew the type profile. Imagine the array of 10K elements, first half populated with the elements of one type, and the other half with the elements of another. If compiler trips after traversing the first 7.5K, then the "observed" type profile would be 66.(6)% of type 1, and 33.(3)% of type 2, therefore obliterating the experimental setup. |
dynamic_Abstract_Ref
behaves almost exactly the same, and generates almost the same code, which explains why the
performance is similar to dynamic_Interface_Ref
.
4.1.2. C1: dynamic_…_Ref
Benchmark (bias) (count) Mode Cnt Score Error Units
Two.dynamic_Interface_Ref 0.0 10000 avgt 50 59.592 ± 0.379 us/op
Two.dynamic_Interface_Ref 0.1 10000 avgt 50 88.052 ± 0.352 us/op
Two.dynamic_Interface_Ref 0.5 10000 avgt 50 136.241 ± 0.426 us/op
Two.dynamic_Interface_Ref 0.9 10000 avgt 50 89.264 ± 1.142 us/op
Two.dynamic_Interface_Ref 1.0 10000 avgt 50 59.285 ± 1.478 us/op
Two.dynamic_Abstract_Ref 0.0 10000 avgt 50 57.449 ± 2.096 us/op
Two.dynamic_Abstract_Ref 0.1 10000 avgt 50 74.185 ± 1.606 us/op
Two.dynamic_Abstract_Ref 0.5 10000 avgt 50 120.458 ± 0.397 us/op
Two.dynamic_Abstract_Ref 0.9 10000 avgt 50 75.714 ± 1.695 us/op
Two.dynamic_Abstract_Ref 1.0 10000 avgt 50 58.020 ± 1.606 us/op
C1 experiences the same symmetry around bias = 0.5, but now the Interface
and Abstract
cases are clearly different.
If you disassemble both, you will see that in both cases the calls are not inlined, and look similarly to C1 version
in single type case. However, there are also additional pieces of hot code, attributed to the VM stubs that do the
actual vtable (virtual calls table) and itable (interface calls table) dispatches.
This is how the vtable stub looks like:
Decoding VtableStub vtbl[5]@12
1.68% 1.32% mov 0x8(%rsi),%eax ; unpack class word
3.50% 3.34% shl $0x3,%rax
1.69% 1.84% mov 0x1e0(%rax),%rbx ; select vtable
9.21% 8.17% jmpq *0x40(%rbx) ; select vtable[0x40], and jump
This is how the itable stub looks like:
Decoding VtableStub itbl[0]@12
1.86% 0.90% mov 0x8(%rsi),%r10d ; unpack classword
1.80% 1.73% shl $0x3,%r10
1.07% 0.71% mov 0x120(%r10),%r11d ; select interface table
7.27% 5.22% lea 0x1b8(%r10,%r11,8),%r11 ; ...and lookup the interface below
4.09% 4.13% lea (%r10),%r10
0.03% 0.06% mov (%r11),%rbx
7.98% 12.49% cmp %rbx,%rax
test %rbx,%rbx
je 0x00007f9c4d117cca
mov (%r11),%rbx
cmp %rbx,%rax
jne 0x00007f9c4d117caa
2.55% 2.60% mov 0x8(%r11),%r11d
0.98% 1.48% mov (%r10,%r11,1),%rbx
8.43% 11.33% jmpq *0x40(%rbx) ; select target method, and jump
Note that interface selection is generally larger, because dispatching the interface call takes much more work, and
this explains the performance difference between dynamic_Interface_Ref
and dynamic_Abstract_Ref
cases.
The interface calls are interesting in a way you have to deal with the dispatch. Virtual calls are easy in the sense that we can construct the vtables for entire class hierarchy on the spot, and have just a single indirection off the vtable to resolve the call target. Interface calls, however, are more complicated: we only observe the interface klass, not the implementation class, and therefore we can’t predict at which offset in concrete class vtable to look for the interface call target. Therefore, we have to consult the interface table for the given object reference, figure out the vtable for that
particular interface, and then call off the resolved vtable. The interface table lies in generic location within each class,
and therefore |
4.1.3. Interpreter
Benchmark (bias) (count) Mode Cnt Score Error Units
Two.dynamic_Interface_Ref 0.0 10000 avgt 50 477.109 ± 9.907 us/op
Two.dynamic_Interface_Ref 0.1 10000 avgt 50 469.529 ± 2.929 us/op
Two.dynamic_Interface_Ref 0.5 10000 avgt 50 482.693 ± 5.955 us/op
Two.dynamic_Interface_Ref 0.9 10000 avgt 50 473.310 ± 4.758 us/op
Two.dynamic_Interface_Ref 1.0 10000 avgt 50 480.534 ± 4.382 us/op
Two.dynamic_Abstract_Ref 0.0 10000 avgt 50 454.018 ± 1.001 us/op
Two.dynamic_Abstract_Ref 0.1 10000 avgt 50 459.251 ± 0.366 us/op
Two.dynamic_Abstract_Ref 0.5 10000 avgt 50 458.424 ± 2.020 us/op
Two.dynamic_Abstract_Ref 0.9 10000 avgt 50 464.937 ± 7.386 us/op
Two.dynamic_Abstract_Ref 1.0 10000 avgt 50 454.475 ± 1.075 us/op
The difference between the invokevirtual and invokeinterface dispatch also explains why interpreter tests for
dynamic_Abstract_Ref
are faster than dynamic_Interface_Ref
. We would not dive further into this. Interested readers
are invited to profile these scenarios if they feel important of them.
4.1.4. Discussion II.1
We have seen that the cost of dispatching across two types varies greatly between the compilers and interpreters.
-
The most common observable effect is the difference between invokevirtual and invokeinterface. While the good engineering practice tells us to use interfaces where possible (especially functional interfaces for lambdas), the plain virtual calls can be much cheaper since they don’t involve the complicated call target lookups. It is not the immediate concern for most of the code that would be compiled with C2 though. But both C1 and interpreter demonstrate the measurable difference between calling via invokevirtual and invokeinterface.
-
C2 does an interesting profile-guided optimization based on the observed type profile. If there is only a single receiver type (that is, the call site is monomorphic), it can simply check for the predicted type, and inline the target directly. The same optimization can and will be applied if there are two receiver types observed (that is, the call site is bimorphic), at the cost of two branches.
-
Even with aggressive bimorphic inlining, the hardware has problems with mispredicted branches that will inevitably be present if the types are non-uniformly distributed. However, this is still better than going via the vtable/itable stubs, and experience the same mispredict there, as evidenced by comparing C2 vs. C1.
4.2. Cheating The Runtime
All right, now that we know the problems with generic dispatch over virtual and interface calls, can we cheat? We already know that static calls are perfectly resolved by most compilers, and even the interpreter. The caveat is that we can’t dispatch multiple types over the static method.
Can we implement the auxiliary selectors though? That is, can we store some field in the Data
class itself,
and dispatch across static calls by ourselves?
Let’s implement these:
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Static_ID_switch() {
switch (id) {
case 0: return Coder0.staticWork(data);
case 1: return Coder1.staticWork(data);
default:
throw new IllegalStateException();
}
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Static_ID_ifElse() {
if (id == 0) {
return Coder0.staticWork(data);
} else if (id == 1) {
return Coder1.staticWork(data);
} else {
throw new IllegalStateException();
}
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Static_Bool_ifElse() {
if (isCoder0) {
return Coder0.staticWork(data);
} else {
return Coder1.staticWork(data);
}
}
…and compare with the best known dynamic way to dispatch, dynamic_Abstract_Ref
.
4.2.1. C2: static_…
Benchmark (bias) (count) Mode Cnt Score Error Units
# Reference
Two.dynamic_Abstract_Ref 0.0 10000 avgt 50 54.419 ± 1.733 us/op
Two.dynamic_Abstract_Ref 0.1 10000 avgt 50 58.244 ± 0.652 us/op
Two.dynamic_Abstract_Ref 0.5 10000 avgt 50 95.765 ± 0.180 us/op
Two.dynamic_Abstract_Ref 0.9 10000 avgt 50 58.618 ± 0.843 us/op
Two.dynamic_Abstract_Ref 1.0 10000 avgt 50 52.412 ± 0.185 us/op
# if (coder0) Coder0.staticWork(data) else Coder1.staticWork(data);
Two.static_Bool_ifElse 0.0 10000 avgt 50 50.206 ± 0.730 us/op
Two.static_Bool_ifElse 0.1 10000 avgt 50 55.036 ± 0.686 us/op
Two.static_Bool_ifElse 0.5 10000 avgt 50 88.445 ± 0.508 us/op
Two.static_Bool_ifElse 0.9 10000 avgt 50 54.730 ± 0.733 us/op
Two.static_Bool_ifElse 1.0 10000 avgt 50 48.966 ± 1.338 us/op
# if (id == 0) Coder0.staticWork(data) else if (id == 1) Coder1.staticWork(data) ...
Two.static_ID_ifElse 0.0 10000 avgt 50 48.506 ± 0.274 us/op
Two.static_ID_ifElse 0.1 10000 avgt 50 56.479 ± 1.092 us/op
Two.static_ID_ifElse 0.5 10000 avgt 50 88.368 ± 0.205 us/op
Two.static_ID_ifElse 0.9 10000 avgt 50 55.940 ± 0.935 us/op
Two.static_ID_ifElse 1.0 10000 avgt 50 48.090 ± 0.406 us/op
# switch (id) case 0: Coder0.staticWork(data); break; case 1: Coder1.staticWork(data); break;
Two.static_ID_switch 0.0 10000 avgt 50 50.641 ± 2.770 us/op
Two.static_ID_switch 0.1 10000 avgt 50 57.856 ± 1.156 us/op
Two.static_ID_switch 0.5 10000 avgt 50 89.138 ± 0.265 us/op
Two.static_ID_switch 0.9 10000 avgt 50 56.312 ± 1.045 us/op
Two.static_ID_switch 1.0 10000 avgt 50 48.931 ± 1.009 us/op
As we can see, the cheating somewhat helps, but why? We have seen that monomorphic and bimorphic inlining is actually quite good, how can we win against that? Let’s disassemble stuff.
This is static_Bool_ifElse
and bias = 0.0:
[Verified Entry Point]
1.24% 0.73% mov %eax,-0x14000(%rsp)
3.01% 2.43% push %rbp
sub $0x20,%rsp
1.56% 0.76% movzbl 0xc(%rsi),%r11d ; get field $isCoder0
15.25% 20.43% test %r11d,%r11d
je SLOW_PATH ; jump out if ($isCoder0 == false)
1.12% 1.12% mov 0x10(%rsi),%r11d ; get field $data
5.42% 7.05% mov 0xc(%r12,%r11,8),%eax ; Coder0.work() body(), arraylength
37.71% 32.57% add $0x20,%rsp ; epilogue and return
1.51% 1.67% pop %rbp
0.10% test %eax,0x1814ecd6(%rip)
5.71% 4.06% retq
SLOW_PATH:
mov %rsi,%rbp ; handle the rare case
mov %r11d,(%rsp)
mov $0xffffff65,%esi
callq 0x00007f32910051a0 ; call into <uncommon trap>
Notice how similar it is to monomorphic inlining! We even have the uncommon trap on the cold branch. The only difference
against the dynamic_Abstract_Ref
case is that we don’t mess with comparing the classword, which saves some instructions.
This explains the minuscule improvement.
static_Bool_ifElse
and bias = 0.1 responds similarly to the changed bias:
[Verified Entry Point]
1.07% 0.45% mov %eax,-0x14000(%rsp)
3.04% 3.43% push %rbp
0.03% 0.07% sub $0x10,%rsp
1.36% 0.57% mov 0x10(%rsi),%r11d ; get field $data
23.73% 31.22% movzbl 0xc(%rsi),%r10d ; get field $isCoder0
4.25% 4.70% test %r10d,%r10d
je CODER_1 ; jump out if ($isCoder0 == false)
1.98% 2.53% mov 0xc(%r12,%r11,8),%eax ; Coder0.work() body, arraylength
EPILOG:
21.30% 17.72% add $0x10,%rsp ; epilogue and return
1.10% 1.25% pop %rbp
0.90% 0.47% test %eax,0x187a7416(%rip)
5.87% 3.99% retq
CODER_1:
2.14% 2.03% mov 0xc(%r12,%r11,8),%eax ; Coder1.work() body, arraylength
4.90% 4.51% jmp EPILOG ; jump back
static_Bool_ifElse
and bias = 0.5 is even tighter:
[Verified Entry Point]
0.14% 0.10% mov %eax,-0x14000(%rsp)
2.93% 2.70% push %rbp
0.02% sub $0x10,%rsp
0.17% 0.19% mov 0x10(%rsi),%r11d ; get field $data
24.07% 27.26% movzbl 0xc(%rsi),%r10d ; get field $isCoder0
3.93% 3.69% test %r10d,%r10d
jne CODER_0 ; jump out if ($isCoder0 == false)
6.05% 5.85% mov 0xc(%r12,%r11,8),%eax ; Coder1.work() body, arraylength
13.12% 16.53% jmp EPILOG ; jump to epilogue
CODER_0:
5.95% 6.13% mov 0xc(%r12,%r11,8),%eax ; Coder0.work() body, arraylength
EPILOG:
11.27% 15.97% add $0x10,%rsp ; epilogue and return
0.14% 0.21% pop %rbp
3.09% 1.99% test %eax,0x1815b50f(%rip)
3.10% 2.88% retq
…but this is the real cheat: instead of comparing the classword twice, we compare the flag once, and select either alternative. This explains a significant improvement against the C2 bimorphic inlining. But even in this case, the branch misprediction costs provide the asymmetry across different biases. bias = 0.5, obviously, has the largest chance of mispredicted branches.
4.2.2. C1: static_…
Benchmark (bias) (count) Mode Cnt Score Error Units
# Reference
Two.dynamic_Abstract_Ref 0.0 10000 avgt 50 57.449 ± 2.096 us/op
Two.dynamic_Abstract_Ref 0.1 10000 avgt 50 74.185 ± 1.606 us/op
Two.dynamic_Abstract_Ref 0.5 10000 avgt 50 120.458 ± 0.397 us/op
Two.dynamic_Abstract_Ref 0.9 10000 avgt 50 75.714 ± 1.695 us/op
Two.dynamic_Abstract_Ref 1.0 10000 avgt 50 58.020 ± 1.606 us/op
# if (coder0) Coder0.staticWork(data) else Coder1.staticWork(data);
Two.static_Bool_ifElse 0.0 10000 avgt 50 41.409 ± 0.145 us/op
Two.static_Bool_ifElse 0.1 10000 avgt 50 52.084 ± 0.410 us/op
Two.static_Bool_ifElse 0.5 10000 avgt 50 82.190 ± 1.855 us/op
Two.static_Bool_ifElse 0.9 10000 avgt 50 52.306 ± 0.423 us/op
Two.static_Bool_ifElse 1.0 10000 avgt 50 45.208 ± 0.513 us/op
# if (id == 0) Coder0.staticWork(data) else if (id == 1) Coder1.staticWork(data)
Two.static_ID_ifElse 0.0 10000 avgt 50 40.777 ± 1.236 us/op
Two.static_ID_ifElse 0.1 10000 avgt 50 53.125 ± 2.200 us/op
Two.static_ID_ifElse 0.5 10000 avgt 50 83.833 ± 1.231 us/op
Two.static_ID_ifElse 0.9 10000 avgt 50 53.826 ± 0.490 us/op
Two.static_ID_ifElse 1.0 10000 avgt 50 47.484 ± 0.802 us/op
# switch (id) case 0: Coder0.staticWork(data); break; case 1: Coder1.staticWork(data); break;
Two.static_ID_switch 0.0 10000 avgt 50 45.338 ± 1.404 us/op
Two.static_ID_switch 0.1 10000 avgt 50 53.087 ± 0.193 us/op
Two.static_ID_switch 0.5 10000 avgt 50 83.418 ± 0.900 us/op
Two.static_ID_switch 0.9 10000 avgt 50 54.315 ± 1.435 us/op
Two.static_ID_switch 1.0 10000 avgt 50 46.568 ± 0.446 us/op
Remember how C1 is not able to properly inline neither abstract
nor interface
bimorphic calls. This is why our
cheaty ways of dispatch are apparently winning. Let’s confirm with some disassembly.
Two.do_Static_Bool_ifElse
, bias = 0.0
:
[Verified Entry Point]
3.20% 3.99% mov %eax,-0x14000(%rsp)
2.34% 2.78% push %rbp
2.23% 1.68% sub $0x30,%rsp
1.29% 2.36% movsbl 0xc(%rsi),%eax ; get field $coder0
4.58% 4.19% cmp $0x0,%eax
je CODER_0 ; jump out if ($isCoder0 == false)
2.59% 1.63% mov 0x10(%rsi),%eax ; get field $data
1.91% 1.92% shl $0x3,%rax ; unpack it
0.48% 0.91% mov 0xc(%rax),%eax ; Coder0.staticWork() body, arraylength
36.12% 30.38% add $0x30,%rsp ; epilogue and return
1.07% 0.45% pop %rbp
0.32% 0.86% test %eax,0x16b80b92(%rip)
5.06% 3.71% retq
CODER_0:
mov 0x10(%rsi),%eax
shl $0x3,%rax ; handle ($isCoder0 == false) case
mov 0xc(%rax),%eax ; Coder1.staticWork() body, arraylength
...
C1 is not emitting the uncommon traps, and generates both branches. The frequency-based basic block layout still
moved the most frequent code path higher. Note how both Coder0
and Coder1
calls are inlined.
do_Static_Bool_ifElse
and bias = 0.5
:
[Verified Entry Point]
1.89% 2.80% mov %eax,-0x14000(%rsp)
0.91% 0.65% push %rbp
1.84% 1.99% sub $0x30,%rsp
0.24% 0.30% movsbl 0xc(%rsi),%eax ; get field $coder0
1.97% 1.56% cmp $0x0,%eax
je CODER_1 ; jump out if ($isCoder0 == false)
3.54% 4.32% mov 0x10(%rsi),%eax ; get field $data
6.23% 7.63% shl $0x3,%rax ; unpack it
0.56% 1.26% mov 0xc(%rax),%eax ; Coder0.staticWork() body
9.48% 10.11% add $0x30,%rsp ; epilogue and return
0.09% 0.07% pop %rbp
0.07% 0.06% test %eax,0x16982ed2(%rip)
2.17% 1.82% retq
CODER_1:
3.99% 4.16% mov 0x10(%rsi),%eax ; get field $data
5.14% 7.07% shl $0x3,%rax ; unpack it
0.82% 1.32% mov 0xc(%rax),%eax ; Coder1.staticWork() body
13.50% 13.58% add $0x30,%rsp ; epilogue and return
0.13% pop %rbp
0.06% test %eax,0x16982ebc(%rip)
2.50% 1.60% retq
This code looks more aesthetically pleasing to my eyes: there is only a single test, and two complete branches. It would be even better if global code motion moved the getfield of $data before the branch, but again, the goal for C1 is to produce the code fast.
4.2.3. Interpreter
Benchmark (bias) (count) Mode Cnt Score Error Units
# Reference
Two.dynamic_Abstract_Ref 0.0 10000 avgt 50 454.018 ± 1.001 us/op
Two.dynamic_Abstract_Ref 0.1 10000 avgt 50 459.251 ± 0.366 us/op
Two.dynamic_Abstract_Ref 0.5 10000 avgt 50 458.424 ± 2.020 us/op
Two.dynamic_Abstract_Ref 0.9 10000 avgt 50 464.937 ± 7.386 us/op
Two.dynamic_Abstract_Ref 1.0 10000 avgt 50 454.475 ± 1.075 us/op
# if (coder0) Coder0.staticWork(data) else Coder1.staticWork(data);
Two.static_Bool_ifElse 0.0 10000 avgt 50 432.259 ± 1.603 us/op
Two.static_Bool_ifElse 0.1 10000 avgt 50 445.540 ± 1.603 us/op
Two.static_Bool_ifElse 0.5 10000 avgt 50 490.248 ± 1.594 us/op
Two.static_Bool_ifElse 0.9 10000 avgt 50 460.986 ± 6.560 us/op
Two.static_Bool_ifElse 1.0 10000 avgt 50 444.312 ± 1.988 us/op
# if (id == 0) Coder0.staticWork(data) else if (id == 1) Coder1.staticWork(data) ...
Two.static_ID_ifElse 0.0 10000 avgt 50 435.689 ± 4.659 us/op
Two.static_ID_ifElse 0.1 10000 avgt 50 449.781 ± 2.840 us/op
Two.static_ID_ifElse 0.5 10000 avgt 50 507.018 ± 3.015 us/op
Two.static_ID_ifElse 0.9 10000 avgt 50 469.138 ± 0.913 us/op
Two.static_ID_ifElse 1.0 10000 avgt 50 461.163 ± 0.813 us/op
# switch (id) case 0: Coder0.staticWork(data); break; case 1: Coder1.staticWork(data); break;
Two.static_ID_switch 0.0 10000 avgt 50 474.470 ± 1.393 us/op
Two.static_ID_switch 0.1 10000 avgt 50 476.816 ± 1.699 us/op
Two.static_ID_switch 0.5 10000 avgt 50 506.091 ± 1.520 us/op
Two.static_ID_switch 0.9 10000 avgt 50 472.006 ± 4.305 us/op
Two.static_ID_switch 1.0 10000 avgt 50 465.332 ± 2.689 us/op
Argh, but this trick does not work for the interpreter. The reason is actually quite simple: the additional accesses for coder static instances cost something.
4.2.4. Discussion II.2
As we predicted, cheating VM by dispatching the work manually works:
-
In C2 case, we save a few bytes in the instruction stream by not comparing the classword, and generally producing denser code. It is cheating, since compiler has to provide the capabilities for de-optimization, handling other types, if any come in the future, etc., but our own code ignores all those concerns. That is, it’s not the compiler fault to produce less optimal code, it’s just has to be more generic.
-
In C1 case, we enable the inlining of target methods with using
static
targets. The dispatch code compiles into something similar to what the C2 version is doing. Therefore, the performance across C1 and C2 is more consistent here.
Once again, these optimizations should be employed only where the peak performance is needed. The code otherwise produced by C2 is rather nice.
4.3. Cheated By The Runtime
All right, but what about combining the auxiliary selector and dynamic cases? Surely we will set ourselves up for
better inlining decisions if we do dynamic calls off the static final
constants? This is where it starts to get
funny. Let’s take C2 again, and run one of the dynamic_…
tests with a trick:
private AbstractCoder abstract_Bool_IfElse() {
return (isCoder0 ? coder0 : coder1);
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Dynamic_Abstract_Bool_ifElse() {
return abstract_Bool_IfElse().abstractWork(data);
}
Benchmark (bias) (count) Mode Cnt Score Error Units
Two.dynamic_Abstract_Ref 0.0 10000 avgt 50 54.419 ± 1.733 us/op
Two.dynamic_Abstract_Ref 0.1 10000 avgt 50 58.244 ± 0.652 us/op
Two.dynamic_Abstract_Ref 0.5 10000 avgt 50 95.765 ± 0.180 us/op
Two.dynamic_Abstract_Ref 0.9 10000 avgt 50 58.618 ± 0.843 us/op
Two.dynamic_Abstract_Ref 1.0 10000 avgt 50 52.412 ± 0.185 us/op
Two.dynamic_Abstract_Bool_ifElse 0.0 10000 avgt 50 50.240 ± 3.273 us/op
Two.dynamic_Abstract_Bool_ifElse 0.1 10000 avgt 50 61.233 ± 0.257 us/op
Two.dynamic_Abstract_Bool_ifElse 0.5 10000 avgt 50 100.861 ± 0.792 us/op
Two.dynamic_Abstract_Bool_ifElse 0.9 10000 avgt 50 60.693 ± 0.119 us/op
Two.dynamic_Abstract_Bool_ifElse 1.0 10000 avgt 50 54.018 ± 1.370 us/op
With bias=0.0, the generated code looks very optimized.
[Verified Entry Point]
1.41% 0.97% mov %eax,-0x14000(%rsp)
3.47% 2.98% push %rbp
0.02% 0.02% sub $0x20,%rsp
1.31% 0.77% movzbl 0xc(%rsi),%r11d ; get field $isCoder0
17.88% 24.63% test %r11d,%r11d
je SLOW_PATH ; jump out if ($isCoder0 == false)
1.58% 1.72% mov 0x10(%rsi),%r11d ; get field $data
3.93% 4.99% mov 0xc(%r12,%r11,8),%eax ; Coder0.abstractWork() body, arraylength
32.66% 27.46% add $0x20,%rsp ; epilog and return
1.43% 1.60% pop %rbp
0.02% test %eax,0x17cdcad6(%rip)
6.41% 5.08% retq
SLOW_PATH:
mov %rsi,%rbp
mov %r11d,(%rsp)
mov $0xffffff65,%esi
callq 0x00007f92150051a0 ; <uncommon trap>
In fact, it is almost the same code that static_Bool_isElse
is generating! Dumb performance tests like this will delude
you into believing you this trick costs you nothing when a sufficiently smart compiler is around.
It gets obvious you shouldn’t rely on this, if when you try with non-trivial bias, like bias=0.5:
[Verified Entry Point]
0.37% 0.77% mov %eax,-0x14000(%rsp)
1.84% 2.44% push %rbp
0.13% 0.12% sub $0x20,%rsp
0.40% 0.77% mov 0x10(%rsi),%r11d ; get field $data
16.73% 16.54% movzbl 0xc(%rsi),%r10d ; get field $isCoder0
2.04% 2.25% mov $0x71943eae0,%r8 ; get static field Data.coder1
0.13% 0.25% mov $0x71943ead0,%r9 ; get static field Data.coder0
0.10% 0.22% test %r10d,%r10d ; select either field based on $isCoder
2.01% 1.79% cmovne %r9,%r8
2.24% 1.77% mov 0x8(%r8),%r10d ; load coder.<classword>
8.38% 9.45% cmp $0xf801040a,%r10d ; if coder is Coder0
je CODER_0 ; jump over
5.65% 9.22% cmp $0xf8012585,%r10d ; if coder is Coder1
jne SLOW_PATH ; not? jump to slow path
3.39% 3.74% mov 0xc(%r12,%r11,8),%eax ; Coder1.abstractWork() body, arraylength
10.31% 9.37% jmp EPILOG ; jump to epilog
CODER_0:
5.18% 8.45% mov 0xc(%r12,%r11,8),%eax ; Coder0.abstractWork() body, arraylength
EPILOG:
16.63% 15.14% add $0x20,%rsp ; epilog and return
0.17% 0.27% pop %rbp
1.92% 0.47% test %eax,0x188e7ce3(%rip)
2.64% 2.05% retq
SLOW_PATH:
mov $0xffffffc6,%esi
mov %r8,%rbp
mov %r11d,(%rsp)
nop
callq 0x00007f3dc10051a0 ; <uncommon trap>
See what happened? We do double work: first, we select the static field based on the flag, as our Java code does; then,
we are doing the bimorphic inlining of based on the actual type. This is regardless the fact we may infer the actual
type from the static final
constant itself. The trick is not working, and it actually degrades performance compared
to dynamic_Ref
case.
4.4. Emulating The Inlining
An impatient reader may wonder what will happen if we emulate the type checks with explicit instanceof
checks?
We can easily add a few more tests
that dispatch like that, for example:
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Static_Interface_Ref_ifElse() {
if (coder instanceof Coder0) {
return Coder0.staticWork(data);
} else if (coder instanceof Coder1) {
return Coder1.staticWork(data);
} else {
throw new IllegalStateException();
}
}
Of course, we don’t know beforehand if the performance is affected by instanceof
-ing against
abstract
or interface
instance, and therefore we need to check both. Similarly, we don’t know if calling the
dynamic method after the instanceof
check will be nicely optimized, so we have to try calling static method as well.
4.4.1. C2
Benchmark (bias) (count) Mode Cnt Score Error Units
; abstractCoder.abstractWork(data)
Two.dynamic_Abstract_Ref 0.0 10000 avgt 50 54.419 ± 1.733 us/op
Two.dynamic_Abstract_Ref 0.1 10000 avgt 50 58.244 ± 0.652 us/op
Two.dynamic_Abstract_Ref 0.5 10000 avgt 50 95.765 ± 0.180 us/op
Two.dynamic_Abstract_Ref 0.9 10000 avgt 50 58.618 ± 0.843 us/op
Two.dynamic_Abstract_Ref 1.0 10000 avgt 50 52.412 ± 0.185 us/op
; if (abstractCoder instanceof Coder0) abstractCoder.abstractWork(data) ...
Two.dynamic_Abstract_Ref_ifElse 0.0 10000 avgt 50 54.275 ± 0.209 us/op
Two.dynamic_Abstract_Ref_ifElse 0.1 10000 avgt 50 62.353 ± 1.610 us/op
Two.dynamic_Abstract_Ref_ifElse 0.5 10000 avgt 50 95.331 ± 0.252 us/op
Two.dynamic_Abstract_Ref_ifElse 0.9 10000 avgt 50 62.308 ± 0.579 us/op
Two.dynamic_Abstract_Ref_ifElse 1.0 10000 avgt 50 55.500 ± 0.423 us/op
; if (coder instanceof Coder0) coder.work(data) ...
Two.dynamic_Interface_Ref_ifElse 0.0 10000 avgt 50 52.735 ± 0.703 us/op
Two.dynamic_Interface_Ref_ifElse 0.1 10000 avgt 50 62.417 ± 0.625 us/op
Two.dynamic_Interface_Ref_ifElse 0.5 10000 avgt 50 95.351 ± 0.492 us/op
Two.dynamic_Interface_Ref_ifElse 0.9 10000 avgt 50 63.060 ± 0.774 us/op
Two.dynamic_Interface_Ref_ifElse 1.0 10000 avgt 50 55.465 ± 0.811 us/op
; if (coder instanceof Coder0) Coder0.staticWork(data) ...
Two.static_Interface_Ref_ifElse 0.0 10000 avgt 50 50.531 ± 2.636 us/op
Two.static_Interface_Ref_ifElse 0.1 10000 avgt 50 62.725 ± 1.784 us/op
Two.static_Interface_Ref_ifElse 0.5 10000 avgt 50 95.554 ± 0.481 us/op
Two.static_Interface_Ref_ifElse 0.9 10000 avgt 50 61.832 ± 0.644 us/op
Two.static_Interface_Ref_ifElse 1.0 10000 avgt 50 53.918 ± 0.730 us/op
; if (abstractCoder instanceof Coder0) Coder0.staticWork(data) ...
Two.static_Abstract_Ref_ifElse 0.0 10000 avgt 50 55.039 ± 0.425 us/op
Two.static_Abstract_Ref_ifElse 0.1 10000 avgt 50 57.879 ± 0.892 us/op
Two.static_Abstract_Ref_ifElse 0.5 10000 avgt 50 95.014 ± 0.660 us/op
Two.static_Abstract_Ref_ifElse 0.9 10000 avgt 50 57.665 ± 0.795 us/op
Two.static_Abstract_Ref_ifElse 1.0 10000 avgt 50 49.874 ± 0.183 us/op
These results seem to imply there is no significant difference either way. It will be enlightening if we disassemble a few cases to understand how they are compiled
dynamic_Interface_Ref_ifElse
with bias=0.1:
[Verified Entry Point]
0.01% mov %eax,-0x14000(%rsp)
4.22% 3.71% push %rbp
0.01% sub $0x20,%rsp
0.05% 0.21% mov 0x14(%rsi),%r11d ; get field $coder
15.77% 20.67% mov 0x8(%r12,%r11,8),%r10d ; load $coder.<classword>
7.16% 7.92% mov 0x10(%rsi),%r9d ; get field $data
0.17% 0.08% cmp $0xf8010422,%r10d ; check $coder is Coder0
jne CODER_1 ; jump further if not
1.27% 1.59% mov 0xc(%r12,%r9,8),%eax ; Coder0.work() body, arraylength
EPILOG:
32.13% 29.84% add $0x20,%rsp ; epilogue and return
pop %rbp
0.55% 0.24% test %eax,0x18a10f8e(%rip)
4.84% 3.66% retq
CODER_1:
1.73% 1.65% cmp $0xf8012585,%r10d ; check if $coder is Coder1
jne SLOW_PATH ; jump to slowpath if not
1.28% 1.64% mov 0xc(%r12,%r9,8),%eax ; Coder1.work() body, arraylength
3.59% 3.13% jmp EPILOG ; jump back
SLOW_PATH:
<omitted>
static_Abstract_Ref_ifElse
with bias=0.1:
[Verified Entry Point]
0.01% mov %eax,-0x14000(%rsp)
4.23% 3.68% push %rbp
0.01% sub $0x20,%rsp
0.04% 0.08% mov 0x18(%rsi),%r11d ; get field $coder
18.17% 23.09% mov 0x8(%r12,%r11,8),%r10d ; load $coder.<classword>
13.54% 16.20% mov 0x10(%rsi),%r9d ; get field $data
1.18% 1.34% cmp $0xf8010422,%r10d ; check $coder is Coder0
jne CODER_1 ; jump further if not
2.08% 2.68% mov 0xc(%r12,%r9,8),%eax ; Coder0.work() body, arraylength
EPILOG:
19.26% 15.89% add $0x20,%rsp ; epilogue and return
pop %rbp
0.66% 0.41% test %eax,0x18b55c8e(%rip)
4.60% 3.56% retq
CODER_1:
2.54% 2.16% cmp $0xf8012585,%r10d ; check if $coder is Coder1
jne SLOW_PATH ; jump to slowpath if not
1.67% 1.79% mov 0xc(%r12,%r9,8),%eax ; Coder1.work() body, arraylength
3.71% 2.54% jmp EPILOG ; jump back
SLOW_PATH:
<omitted>
Not only these pieces of code are exactly the same, they are also equivalent to letting C2 doing the bimorphic inlining on its own! Let us remember this low-level trick, as we will reuse it later when we deal with several types.
This is possible since |
4.4.2. C1
Benchmark (bias) (count) Mode Cnt Score Error Units
; abstractCoder.abstractWork(data)
Two.dynamic_Abstract_Ref 0.0 10000 avgt 50 57.449 ± 2.096 us/op
Two.dynamic_Abstract_Ref 0.1 10000 avgt 50 74.185 ± 1.606 us/op
Two.dynamic_Abstract_Ref 0.5 10000 avgt 50 120.458 ± 0.397 us/op
Two.dynamic_Abstract_Ref 0.9 10000 avgt 50 75.714 ± 1.695 us/op
Two.dynamic_Abstract_Ref 1.0 10000 avgt 50 58.020 ± 1.606 us/op
; if (abstractCoder instanceof Coder0) abstractCoder.abstractWork(data) ...
Two.dynamic_Abstract_Ref_ifElse 0.0 10000 avgt 50 67.736 ± 0.628 us/op
Two.dynamic_Abstract_Ref_ifElse 0.1 10000 avgt 50 78.921 ± 0.484 us/op
Two.dynamic_Abstract_Ref_ifElse 0.5 10000 avgt 50 127.833 ± 0.418 us/op
Two.dynamic_Abstract_Ref_ifElse 0.9 10000 avgt 50 87.741 ± 0.829 us/op
Two.dynamic_Abstract_Ref_ifElse 1.0 10000 avgt 50 77.055 ± 1.126 us/op
; if (coder instanceof Coder0) coder.work(data) ...
Two.dynamic_Interface_Ref_ifElse 0.0 10000 avgt 50 67.132 ± 0.557 us/op
Two.dynamic_Interface_Ref_ifElse 0.1 10000 avgt 50 81.280 ± 1.099 us/op
Two.dynamic_Interface_Ref_ifElse 0.5 10000 avgt 50 127.587 ± 0.195 us/op
Two.dynamic_Interface_Ref_ifElse 0.9 10000 avgt 50 87.288 ± 0.809 us/op
Two.dynamic_Interface_Ref_ifElse 1.0 10000 avgt 50 75.220 ± 1.172 us/op
; if (coder instanceof Coder0) Coder0.staticWork(data) ...
Two.static_Interface_Ref_ifElse 0.0 10000 avgt 50 51.127 ± 0.897 us/op
Two.static_Interface_Ref_ifElse 0.1 10000 avgt 50 60.506 ± 2.378 us/op
Two.static_Interface_Ref_ifElse 0.5 10000 avgt 50 103.250 ± 0.207 us/op
Two.static_Interface_Ref_ifElse 0.9 10000 avgt 50 69.919 ± 3.465 us/op
Two.static_Interface_Ref_ifElse 1.0 10000 avgt 50 59.107 ± 2.813 us/op
; if (abstractCoder instanceof Coder0) Coder0.staticWork(data) ...
Two.static_Abstract_Ref_ifElse 0.0 10000 avgt 50 50.748 ± 0.797 us/op
Two.static_Abstract_Ref_ifElse 0.1 10000 avgt 50 61.218 ± 0.320 us/op
Two.static_Abstract_Ref_ifElse 0.5 10000 avgt 50 103.684 ± 0.446 us/op
Two.static_Abstract_Ref_ifElse 0.9 10000 avgt 50 69.081 ± 0.104 us/op
Two.static_Abstract_Ref_ifElse 1.0 10000 avgt 50 58.921 ± 0.335 us/op
C1 responds to this hack less candidly. dynamic
cases are generally slower, because there is still a virtual/interface
call involved, that C1 is not able to inline. static
cases are generally faster, because they avoid doing virtual/interface
calls. Still, static
cases are not on par with clean "by ID" selectors, since the generated code is much less optimized.
Interested readers are invited to disassemble and study it.
4.5. Discussion II
Looking at the case of two coders, we can spot a few things:
-
The usual coding practice of separating the distinct behavior in the distinct (sub)classes or interface implementors is nicely optimized in C2. C1 has some problems with efficiently inlining the bimorphic cases, but thankfully users will only observe this behavior transiently, until C2 kicks in.
-
In some cases, when you know the call targets exactly, it might be a good idea to dispatch statically across a few implementations, in order to avoid dealing with compiler optimizations that still have some overhead left in the generated code. Again, doing so will probably yield a non-generic solution, and that’s one of the reasons it could be faster. The compilers should provide generic implementations, and hence can help only that much.
-
The performance costs we observed, especially in an optimized case, are related to hardware branch (mis)predictions. Even the most optimized code still ought to have branches. It may look like there is no reason to inline the virtual calls then, but that’s a shortsighted view. The inlining actually broadens the scope of other optimizations, and that alone is, in many cases, enough reason to inline. With very thin nanobenchmarks, that advantage cannot be properly quantified.
5. Three Types And Beyond
With three coder types, the coder distribution makes for even more challenging task. The easiest solution would seem to run with different "parts" per coder type.
@Param("1")
private int p1;
@Param("1")
private int p2;
@Param("1")
private int p3;
private Data[] datas;
@Setup
public void setup() {
Random r = new Random(12345);
int s1 = (count * p1) / (p1 + p2 + p3);
int s2 = (count * (p1 + p2)) / (p1 + p2 + p3);
List<Data> ts = new ArrayList<Data>();
for (int c = 0; c < count; c++) {
byte[] contents = new byte[10];
r.nextBytes(contents);
byte id;
if (c < s1) {
id = 0;
} else if (c < s2) {
id = 1;
} else {
id = 2;
}
ts.add(new Data(id, contents));
}
Collections.shuffle(ts, r);
datas = ts.toArray(new Data[0]);
}
Also, we need to extend the selectors for three types. Here is the source code.
5.1. Monomorphic cases
Remember from the previous parts, that compilers make the inlining decisions based on profile. When there is only a single observed type per call site (i.e. the call site is "monomorphic"), or two observed types per call site (i.e. the call site is "bimorphic"), at least C2 is able to optimize this case. Before going further, it makes sense to verify this still holds true for our three-type benchmarks. This way we control that our knowledge from the previous part can be used in this test.
5.1.1. C2
Benchmark (count) (p1) (p2) (p3) Mode Cnt Score Error Units
Three.dynamic_Abstract_Ref 10000 1 0 0 avgt 50 52.384 ± 0.653 us/op
Three.dynamic_Interface_Ref 10000 1 0 0 avgt 50 51.807 ± 0.453 us/op
Three.static_ID_ifElse 10000 1 0 0 avgt 50 48.162 ± 0.182 us/op
Three.static_ID_switch 10000 1 0 0 avgt 50 48.597 ± 0.539 us/op
Three.dynamic_Abstract_Ref 10000 0 1 0 avgt 50 51.663 ± 0.159 us/op
Three.dynamic_Interface_Ref 10000 0 1 0 avgt 50 51.784 ± 0.824 us/op
Three.static_ID_ifElse 10000 0 1 0 avgt 50 48.308 ± 0.855 us/op
Three.static_ID_switch 10000 0 1 0 avgt 50 48.514 ± 0.148 us/op
Three.dynamic_Abstract_Ref 10000 0 0 1 avgt 50 52.356 ± 0.541 us/op
Three.dynamic_Interface_Ref 10000 0 0 1 avgt 50 52.874 ± 0.644 us/op
Three.static_ID_ifElse 10000 0 0 1 avgt 50 48.349 ± 0.200 us/op
Three.static_ID_switch 10000 0 0 1 avgt 50 48.651 ± 0.052 us/op
Everything adds up so far. We covered why the static
cases are slightly faster than dynamic
cases before.
It is interesting to note that even though …ifElse
cases produce the static layout, the performance of which should
theoretically be affected by which coder dominates, it is not affected in practice. For example, static_ID_ifElse
with p1/p2/p3 = 0/0/1 looks like this:
[Verified Entry Point]
0.06% 0.06% mov %eax,-0x14000(%rsp)
4.44% 3.30% push %rbp
0.06% 0.10% sub $0x20,%rsp
0.02% 0.03% movsbl 0xc(%rsi),%r11d ; get field $id
12.10% 16.93% test %r11d,%r11d
je ID_0 ; jump if ($id == 0)
0.65% 0.66% cmp $0x1,%r11d
je ID_1 ; jump if ($id == 1)
0.75% 0.92% cmp $0x2,%r11d
jne FAIL ; jump if ($id != 2)
0.77% 0.85% mov 0x10(%rsi),%r11d ; get field $data
4.15% 2.73% mov 0xc(%r12,%r11,8),%eax ; Coder2.staticWork() body, arraylength
45.11% 44.93% add $0x20,%rsp ; epilogue and return
0.03% pop %rbp
0.03% 0.06% test %eax,0x170e758a(%rip)
6.48% 5.66% retq
FAIL:
<omitted>
ID_0:
<omitted>
ID_1:
<omitted>
Even though we have to go through 2 redundant branches, we still maintain almost the same performance. All branches seem to be nicely predicted.
5.1.2. C1
Benchmark (count) (p1) (p2) (p3) Mode Cnt Score Error Units
Three.dynamic_Abstract_Ref 10000 1 0 0 avgt 50 56.038 ± 2.058 us/op
Three.dynamic_Interface_Ref 10000 1 0 0 avgt 50 55.953 ± 1.388 us/op
Three.static_ID_ifElse 10000 1 0 0 avgt 50 45.050 ± 1.048 us/op
Three.static_ID_switch 10000 1 0 0 avgt 50 45.899 ± 0.199 us/op
Three.dynamic_Abstract_Ref 10000 0 1 0 avgt 50 56.541 ± 0.562 us/op
Three.dynamic_Interface_Ref 10000 0 1 0 avgt 50 55.355 ± 0.182 us/op
Three.static_ID_ifElse 10000 0 1 0 avgt 50 47.162 ± 0.469 us/op
Three.static_ID_switch 10000 0 1 0 avgt 50 46.968 ± 0.198 us/op
Three.dynamic_Abstract_Ref 10000 0 0 1 avgt 50 55.299 ± 0.118 us/op
Three.dynamic_Interface_Ref 10000 0 0 1 avgt 50 56.877 ± 0.469 us/op
Three.static_ID_ifElse 10000 0 0 1 avgt 50 46.654 ± 0.871 us/op
Three.static_ID_switch 10000 0 0 1 avgt 50 47.972 ± 0.198 us/op
The same picture is painted in the C1 case. We already know why static
cases are slightly faster than dynamic
cases, see the cases for single and two types. Here, however, the position at which the type is located seems to matter a bit.
2.37% 3.29% mov %eax,-0x14000(%rsp)
3.12% 2.04% push %rbp
2.12% 2.18% sub $0x40,%rsp
1.46% 0.72% movsbl 0xc(%rsi),%eax ; get field $id
3.43% 2.96% cmp $0x0,%eax
je ID_0 ; jump if ($id == 0)
2.16% 2.17% cmp $0x1,%eax
je ID_1 ; jump if ($id == 1)
0.80% 0.89% cmp $0x2,%eax
jne FAIL ; jump if ($id != 2)
1.75% 1.12% mov 0x10(%rsi),%eax ; get field $data
3.60% 3.56% shl $0x3,%rax ; unpack
1.85% 2.10% mov 0xc(%rax),%eax ; Coder2.staticWork() body, arraylength
29.12% 24.91% add $0x40,%rsp ; epilogue and return
0.71% 1.14% pop %rbp
1.44% 1.22% test %eax,0x18217420(%rip)
5.43% 5.24% retq
FAIL:
<omitted>
ID_0:
<omitted>
ID_1:
<omitted>
The instruction selection is a bit different, as one could expect from different compilers, which may explain the performance difference between the compiled code versions, and why C1 version is reacting to how many branches the code is taking.
My own speculation is like this: |
5.2. Bimorphic cases
5.2.1. C2
Benchmark (count) (p1) (p2) (p3) Mode Cnt Score Error Units
Three.dynamic_Abstract_Ref 10000 0 1 1 avgt 50 95.986 ± 0.835 us/op
Three.dynamic_Interface_Ref 10000 0 1 1 avgt 50 97.242 ± 2.292 us/op
Three.static_ID_ifElse 10000 0 1 1 avgt 50 89.957 ± 0.164 us/op
Three.static_ID_switch 10000 0 1 1 avgt 50 92.412 ± 2.351 us/op
Three.dynamic_Abstract_Ref 10000 1 0 1 avgt 50 97.081 ± 1.534 us/op
Three.dynamic_Interface_Ref 10000 1 0 1 avgt 50 97.001 ± 2.200 us/op
Three.static_ID_ifElse 10000 1 0 1 avgt 50 89.764 ± 1.230 us/op
Three.static_ID_switch 10000 1 0 1 avgt 50 92.158 ± 1.512 us/op
Three.dynamic_Abstract_Ref 10000 1 1 0 avgt 50 96.243 ± 1.528 us/op
Three.dynamic_Interface_Ref 10000 1 1 0 avgt 50 95.130 ± 0.234 us/op
Three.static_ID_ifElse 10000 1 1 0 avgt 50 88.195 ± 0.296 us/op
Three.static_ID_switch 10000 1 1 0 avgt 50 89.794 ± 0.124 us/op
The same thing applies to the bimorphic case. We already know that C2 is able to do the bimorphic inlining. We also know we can cheat and save a few cycles by doing the dispatch on our own.
5.2.2. C1
Benchmark (count) (p1) (p2) (p3) Mode Cnt Score Error Units
Three.dynamic_Abstract_Ref 10000 0 1 1 avgt 50 119.439 ± 0.371 us/op
Three.dynamic_Interface_Ref 10000 0 1 1 avgt 50 136.081 ± 0.605 us/op
Three.static_ID_ifElse 10000 0 1 1 avgt 50 84.079 ± 0.218 us/op
Three.static_ID_switch 10000 0 1 1 avgt 50 84.402 ± 0.860 us/op
Three.dynamic_Abstract_Ref 10000 1 0 1 avgt 50 120.241 ± 0.617 us/op
Three.dynamic_Interface_Ref 10000 1 0 1 avgt 50 136.434 ± 0.673 us/op
Three.static_ID_ifElse 10000 1 0 1 avgt 50 82.849 ± 0.295 us/op
Three.static_ID_switch 10000 1 0 1 avgt 50 83.061 ± 0.401 us/op
Three.dynamic_Abstract_Ref 10000 1 1 0 avgt 50 119.770 ± 0.216 us/op
Three.dynamic_Interface_Ref 10000 1 1 0 avgt 50 137.624 ± 3.552 us/op
Three.static_ID_ifElse 10000 1 1 0 avgt 50 85.120 ± 0.147 us/op
Three.static_ID_switch 10000 1 1 0 avgt 50 84.840 ± 0.218 us/op
Again, nothing new here. We already know that C1 is not able to do proper bimorphic inlining, and we are stuck with non-optimal code. Our custom dispatch enables the inlining, and avoids the VM dispatch.
5.3. Megamorphic cases
Now, when there are more than two types, we are stepping into something new. To make the matters interesting, we are measuring in three configurations: evenly distributing the types (1/1/1), first type to take 90% of all instances (18/1/1), and first type to take 95% of all instances (38/1/1).
5.3.1. C2
Benchmark (count) (p1) (p2) (p3) Mode Cnt Score Error Units
Three.dynamic_Abstract_Ref 10000 1 1 1 avgt 50 138.611 ± 2.524 us/op
Three.dynamic_Interface_Ref 10000 1 1 1 avgt 50 155.505 ± 3.027 us/op
Three.static_ID_ifElse 10000 1 1 1 avgt 50 99.868 ± 2.026 us/op
Three.static_ID_switch 10000 1 1 1 avgt 50 98.905 ± 0.600 us/op
Three.dynamic_Abstract_Ref 10000 18 1 1 avgt 50 79.674 ± 2.423 us/op
Three.dynamic_Interface_Ref 10000 18 1 1 avgt 50 90.016 ± 0.940 us/op
Three.static_ID_ifElse 10000 18 1 1 avgt 50 57.578 ± 1.079 us/op
Three.static_ID_switch 10000 18 1 1 avgt 50 55.561 ± 0.169 us/op
Three.dynamic_Abstract_Ref 10000 38 1 1 avgt 50 58.465 ± 0.160 us/op
Three.dynamic_Interface_Ref 10000 38 1 1 avgt 50 58.335 ± 0.209 us/op
Three.static_ID_ifElse 10000 38 1 1 avgt 50 51.692 ± 1.070 us/op
Three.static_ID_switch 10000 38 1 1 avgt 50 51.860 ± 0.156 us/op
What can we see here?
-
static
cases get better when runs are biased towards one type. This is due to the better branch prediction. This difference provides an estimate what one should expect from the optimized virtual call. (We additionally verified exactly the same result is produced when we bias the distribution towards any other type, not only the first). -
When the types are evenly distributed, we get a severe performance hit in
dynamic_…
cases. This is because HotSpot thinks the call site now has too many receiver types; in other words, the call site is megamorphic. Current C2 does not do megamorphic inlining at all. -
…with the exception of one case, when there is a clear winner, claiming >90% of type profile.
Why we have tried 38/1/1? This distribution allocates 95% of all targets to a single coder, and we know that
C2 implementation
sets the threshold for this optimistic inlining via |
Briefly following this up, and disassembling dynamic_Abstract_Ref
at 18/1/1 yields:
[Verified Entry Point]
0.52% 0.53% mov %eax,-0x14000(%rsp)
2.50% 2.89% push %rbp
sub $0x10,%rsp
2.37% 1.20% mov 0x10(%rsi),%r10d ; get field $data
7.67% 9.87% mov 0x18(%rsi),%r8d ; get field $abstractCoder
1.70% 1.88% mov %r10,%rdx
0.37% 0.24% shl $0x3,%rdx
1.33% 0.60% mov %r8,%rsi
0.63% 1.29% shl $0x3,%rsi
0.51% 0.63% xchg %ax,%ax
0.43% 0.26% mov $0xffffffffffffffff,%rax
1.46% 0.48% callq 0x00007f4bd1046220 ; VIRTUAL CALL TO abstractWork()
2.96% 0.38% add $0x10,%rsp
0.79% 0.94% pop %rbp
1.44% 1.22% test %eax,0x16921d01(%rip)
0.86% 0.12% retq
…and all Coders not inlined in the profile:
....[Hottest Methods (after inlining)].............................................................. 34.02% 37.66% net.shipilev.three.Coder0::abstractWork 25.20% 22.80% net.shipilev.three.Data::do_Dynamic_Abstract_Ref 14.79% 15.94% net.shipilev.three..generated.Three_dynamic_Abstract_Ref::dynamic_Abstract_Ref_avgt_jmhStub 13.60% 13.36% java.util.HashMap::hash 4.54% 3.35% net.shipilev.three.Coder2::abstractWork 4.05% 2.98% net.shipilev.three.Coder1::abstractWork
This is similar to what we saw in C1 case.
Disassembling dynamic_Abstract_Ref
at 38/1/1 shows that Coder0.abstractWork()
got inlined!
[Verified Entry Point]
mov %eax,-0x14000(%rsp)
3.98% 3.41% push %rbp
0.01% sub $0x10,%rsp
mov 0x10(%rsi),%r11d ; get field $data
10.59% 14.23% mov 0x18(%rsi),%r10d ; get field $abstractCoder
1.96% 2.51% mov 0x8(%r12,%r10,8),%r9d ; load $abstractCoder.<classword>
3.54% 4.89% cmp $0xf80103f6,%r9d ;
jne SLOW_PATH ; jump if $abstractCoder is not Coder0
0.98% 1.20% mov 0xc(%r12,%r11,8),%eax ; Coder0.abstractWork() body, arraylength
42.14% 41.75% add $0x10,%rsp ; epilogue and return
0.01% pop %rbp
0.01% test %eax,0x15ed3f4e(%rip)
4.77% 3.80% retq
SLOW_PATH:
0.56% 0.66% mov %r11,%rdx
0.22% 0.20% shl $0x3,%rdx
0.43% 0.35% lea (%r12,%r10,8),%rsi
xchg %ax,%ax
0.01% 0.01% mov $0xffffffffffffffff,%rax
0.21% 0.19% callq 0x00007f564c3c3220 ; VIRTUAL CALL TO abstractWork()
…and no distinct Coder0::abstractWork
in profile:
....[Hottest Methods (after inlining)].............................................................. 70.49% 73.34% net.shipilev.three.Data::do_Dynamic_Abstract_Ref 21.47% 20.44% net.shipilev.three.generated.Three_dynamic_Abstract_Ref::dynamic_Abstract_Ref_avgt_jmhStub 2.12% 1.22% net.shipilev.three.Coder1::abstractWork 2.05% 1.25% net.shipilev.three.Coder2::abstractWork 2.04% 2.12% java.util.concurrent.ConcurrentHashMap::tabAt
5.3.2. C1
Benchmark (count) (p1) (p2) (p3) Mode Cnt Score Error Units
Three.dynamic_Abstract_Ref 10000 1 1 1 avgt 50 138.515 ± 0.595 us/op
Three.dynamic_Interface_Ref 10000 1 1 1 avgt 50 153.387 ± 0.184 us/op
Three.static_ID_ifElse 10000 1 1 1 avgt 50 91.239 ± 0.170 us/op
Three.static_ID_switch 10000 1 1 1 avgt 50 91.631 ± 0.821 us/op
Three.dynamic_Abstract_Ref 10000 18 1 1 avgt 50 72.798 ± 0.205 us/op
Three.dynamic_Interface_Ref 10000 18 1 1 avgt 50 88.188 ± 0.180 us/op
Three.static_ID_ifElse 10000 18 1 1 avgt 50 54.326 ± 0.393 us/op
Three.static_ID_switch 10000 18 1 1 avgt 50 54.454 ± 0.081 us/op
Three.dynamic_Abstract_Ref 10000 38 1 1 avgt 50 67.391 ± 0.928 us/op
Three.dynamic_Interface_Ref 10000 38 1 1 avgt 50 81.523 ± 0.192 us/op
Three.static_ID_ifElse 10000 38 1 1 avgt 50 50.885 ± 0.064 us/op
Three.static_ID_switch 10000 38 1 1 avgt 50 50.374 ± 0.154 us/op
static
cases behave the same as in C2 case. dynamic
cases are not inlined regardless of the profile
information. We have seen it does not inline even the bimorphic calls, and naturally it does not inline the megamorphic
ones as well. The difference between different distributions is explained by branch prediction in either the VM stubs,
or our own static dispatchers.
5.4. Cheating The Runtime
We have already cheated with static
cases in this section, but remember the instanceof
trick. We know that doing
manual instanceof
is akin to doing the VM work. Can we play God here, and peel the first coder under the instanceof
check? In other words, do something like this:
public int do_Peel_Interface_Static() {
if (coder instanceof Coder0) {
return Coder0.staticWork(data);
} else {
return coder.work(data);
}
}
public int do_Peel_Interface_Interface() {
if (coder instanceof Coder0) {
return coder.work(data);
} else {
return coder.work(data);
}
}
5.4.1. C2
Benchmark (count) (p1) (p2) (p3) Mode Cnt Score Error Units
; equidistributed
Three.dynamic_Abstract_Ref 10000 1 1 1 avgt 50 138.611 ± 2.524 us/op
Three.dynamic_Interface_Ref 10000 1 1 1 avgt 50 155.505 ± 3.027 us/op
Three.static_ID_ifElse 10000 1 1 1 avgt 50 99.868 ± 2.026 us/op
Three.static_ID_switch 10000 1 1 1 avgt 50 98.905 ± 0.600 us/op
Three.peel_Abstract_Abstract 10000 1 1 1 avgt 50 105.511 ± 1.733 us/op
Three.peel_Abstract_Static 10000 1 1 1 avgt 50 105.814 ± 1.177 us/op
Three.peel_Interface_Interface 10000 1 1 1 avgt 50 105.841 ± 1.694 us/op
Three.peel_Interface_Static 10000 1 1 1 avgt 50 105.594 ± 1.151 us/op
; biased towards coder0 at 90%
Three.dynamic_Abstract_Ref 10000 18 1 1 avgt 50 79.674 ± 2.423 us/op
Three.dynamic_Interface_Ref 10000 18 1 1 avgt 50 90.016 ± 0.940 us/op
Three.static_ID_ifElse 10000 18 1 1 avgt 50 57.578 ± 1.079 us/op
Three.static_ID_switch 10000 18 1 1 avgt 50 55.561 ± 0.169 us/op
Three.peel_Abstract_Abstract 10000 18 1 1 avgt 50 63.588 ± 2.789 us/op
Three.peel_Abstract_Static 10000 18 1 1 avgt 50 63.164 ± 1.655 us/op
Three.peel_Interface_Interface 10000 18 1 1 avgt 50 64.309 ± 2.956 us/op
Three.peel_Interface_Static 10000 18 1 1 avgt 50 62.016 ± 0.185 us/op
; biased towards coder0 at 95%
Three.dynamic_Abstract_Ref 10000 38 1 1 avgt 50 58.465 ± 0.160 us/op
Three.dynamic_Interface_Ref 10000 38 1 1 avgt 50 58.335 ± 0.209 us/op
Three.static_ID_ifElse 10000 38 1 1 avgt 50 51.692 ± 1.070 us/op
Three.static_ID_switch 10000 38 1 1 avgt 50 51.860 ± 0.156 us/op
Three.peel_Abstract_Abstract 10000 38 1 1 avgt 50 57.915 ± 0.174 us/op
Three.peel_Abstract_Static 10000 38 1 1 avgt 50 58.093 ± 0.263 us/op
Three.peel_Interface_Interface 10000 38 1 1 avgt 50 58.642 ± 1.596 us/op
Three.peel_Interface_Static 10000 38 1 1 avgt 50 57.859 ± 0.198 us/op
What do we see here? Well, it turns out in a truly megamorphic case of 1/1/1, peeling the first coder is clearly profitable. It is profitable with a skewed distribution of 18/1/1 as well. And it stops being profitable at very skewed distribution of 38/1/1. Can you already guess why? You should have learned enough already to provide a sound hypothesis. If you can, then my work is done here.
So, if we look into peel_Interface_Interface
at 1/1/1:
[Verified Entry Point]
0.23% 0.20% mov %eax,-0x14000(%rsp)
2.14% 2.45% push %rbp
0.11% 0.06% sub $0x20,%rsp
0.20% 0.54% mov 0x14(%rsi),%r11d ; get field $coder
19.68% 21.62% mov 0x8(%r12,%r11,8),%r10d ; load $coder.<classword>
7.74% 10.11% mov 0x10(%rsi),%ebp ; get field $data
0.17% 0.15% cmp $0xf8010426,%r10d ; check if $coder == Coder0
jne OTHER_CODERS ; jump further if not
5.95% 8.92% mov 0xc(%r12,%rbp,8),%eax ; Coder0.work() body, arraylength
EPILOG:
9.35% 8.89% add $0x20,%rsp ; epilogue and return
0.10% 0.05% pop %rbp
2.89% 1.11% test %eax,0x15f1f36f(%rip)
2.16% 1.32% retq
OTHER_CODERS:
2.63% 3.82% cmp $0xf80125c8,%r10d ; check if $coder == Coder2
je CODER_2 ; jump over if so
3.53% 4.68% cmp $0xf8012585,%r10d ; check if $coder == Coder1
jne SLOW_PATH ; jump to slow path if not
2.38% 2.17% mov 0xc(%r12,%rbp,8),%eax ; Coder1.work() body, arraylength
10.51% 10.05% jmp EPILOG ; jump back to epilogue
CODER_2:
2.92% 4.08% mov 0xc(%r12,%rbp,8),%eax ; Coder2.work() body, arraylength
5.34% 5.70% jmp EPILOG ; jump back to epilogue
SLOW_PATH:
mov $0xffffffc6,%esi ; slow path
mov %r11d,(%rsp)
callq 0x00007fb38cdea1a0
See the magic here? The call on else branch inlined both coders. Note this works even if we leave the virtual/interface
call under the check, because now we have two call sites, both with distinct type profiles. First call site would be
inlined as monomorphic, and the type check would be merged with the original instanceof
. Second call site is now
effectively bimorphic, and enjoys the bimorphic inlining.
5.4.2. C1
Benchmark (count) (p1) (p2) (p3) Mode Cnt Score Error Units
; equidistributed
Three.dynamic_Abstract_Ref 10000 1 1 1 avgt 50 138.515 ± 0.595 us/op
Three.dynamic_Interface_Ref 10000 1 1 1 avgt 50 153.387 ± 0.184 us/op
Three.static_ID_ifElse 10000 1 1 1 avgt 50 91.239 ± 0.170 us/op
Three.static_ID_switch 10000 1 1 1 avgt 50 91.631 ± 0.821 us/op
Three.peel_Abstract_Abstract 10000 1 1 1 avgt 50 146.658 ± 4.189 us/op
Three.peel_Abstract_Static 10000 1 1 1 avgt 50 136.430 ± 0.507 us/op
Three.peel_Interface_Interface 10000 1 1 1 avgt 50 156.816 ± 0.440 us/op
Three.peel_Interface_Static 10000 1 1 1 avgt 50 148.564 ± 0.541 us/op
; biased towards coder0 at 90%
Three.dynamic_Abstract_Ref 10000 18 1 1 avgt 50 72.798 ± 0.205 us/op
Three.dynamic_Interface_Ref 10000 18 1 1 avgt 50 88.188 ± 0.180 us/op
Three.static_ID_ifElse 10000 18 1 1 avgt 50 54.326 ± 0.393 us/op
Three.static_ID_switch 10000 18 1 1 avgt 50 54.454 ± 0.081 us/op
Three.peel_Abstract_Abstract 10000 18 1 1 avgt 50 83.206 ± 0.257 us/op
Three.peel_Abstract_Static 10000 18 1 1 avgt 50 65.073 ± 0.731 us/op
Three.peel_Interface_Interface 10000 18 1 1 avgt 50 87.273 ± 0.105 us/op
Three.peel_Interface_Static 10000 18 1 1 avgt 50 66.372 ± 0.121 us/op
; biased towards coder0 at 95%
Three.dynamic_Abstract_Ref 10000 38 1 1 avgt 50 67.391 ± 0.928 us/op
Three.dynamic_Interface_Ref 10000 38 1 1 avgt 50 81.523 ± 0.192 us/op
Three.static_ID_ifElse 10000 38 1 1 avgt 50 50.885 ± 0.064 us/op
Three.static_ID_switch 10000 38 1 1 avgt 50 50.374 ± 0.154 us/op
Three.peel_Abstract_Abstract 10000 38 1 1 avgt 50 76.414 ± 1.011 us/op
Three.peel_Abstract_Static 10000 38 1 1 avgt 50 58.389 ± 1.493 us/op
Three.peel_Interface_Interface 10000 38 1 1 avgt 50 81.647 ± 4.046 us/op
Three.peel_Interface_Static 10000 38 1 1 avgt 50 59.273 ± 0.848 us/op
Unfortunately, this trick is only helping that much on C1. In the absence of profile-assisted inlining, there is not relief from reducing the type profile to two types. Again, interested readers are welcome to disassemble this case if they want.
5.5. Discussion III
Looking at the case of three coders, we can spot a few things:
-
Even if the static analysis shows there are three possible call targets, compilers can use the dynamic type profile to speculatively optimize for the exact receivers. C2 does this routinely in monomorphic and bimorphic cases, and sometimes in megamorphic cases, when there is a clear winner. C1 does not inline megamorphic calls.
-
The (absence of) megamorphic inlining hurts at least the nano-benchmarks. There seem to be three ways to avoid it, either peel off the hottest type with manual checks, or decrease
TypeProfileMajorReceiverPercent
to let VM figure out and inline the most frequent target, or do static dispatch over the known targets. The deal with megamorphic inlining would need to be solved in one way or the other on the VM side, since this potentially affects lots of modern workloads.
Conclusion
-
You should not normally bother with method dispatch performance. The optimizing compiler that produces the final code for hottest methods is able to inline most virtual calls. The remaining points are valid given you identified your problem as the method dispatch/inlining performance problem.
-
You should actually care about the inlineability of the target method, e.g. it’s size, modifiers, etc. In this post, we have ignored this aspect, since we were using tiny trivial methods. However, if the target method cannot be successfully devirtualized, the inlining will not happen. The inlining actually broadens the scope of other optimizations, and that alone is, in many cases, enough reason to inline. With very thin nanobenchmarks, that advantage cannot be properly quantified.
-
Class Hierarchy Analysis is able to statically figure out there is only a single subclass of a given class, even in the absence of profile information. Take care when you are adding more subclasses to otherwise a lone super-class in the hierarchy: CHA can fail then, and invalidate your previous performance assumptions.
-
When CHA fails, C1 inlining for virtual and interface calls also fails, since type profile is not available.
-
Even when CHA fails, monomorphic and bimorphic calls are routinely inlined by C2. Morphicity is derived from the runtime type profile collected by interpreter or C1.
-
Megamorphic calls are very bad, neither C1 nor C2 can inline those. At this point, there seem to be three ways to avoid it, either peel off the hottest type with manual checks, or decrease
TypeProfileMajorReceiverPercent
to let VM figure out and inline the most frequent target, or do static dispatch over the known targets. VM might do better for these cases in future though. -
If you have a simple method implementation that does not depend on instance state, it is a good idea to make it
static
, for both maintainability and performance reasons. Carrying around the class instance just to make the virtual call off it makes life harder for runtime. -
When you need peak performance for method dispatch, you may choose to manually dispatch over the
static
implementations. This goes against the usual development practice, but then quite a few low-level performance hacks take the same diversion route. This is what we are going to do in String Compression work, especially givenString
is alreadyfinal
, and adding a reference field toString
is worse for footprint than adding abyte
ID. -
If you are choosing between interfaces and abstract classes, interfaces should not be your choice. Unoptimized interface calls are a burden. But, if you have to care about this difference, it means the profile-based de-virtualization and inlining did not happen, and you are probably already screwed.
-
Mispredicted branches are the killer. If you want to rigorously quantify the effects of low-level hacks like this, you have to consider trying different source data mixes to exploit different branch prediction behaviors, or choose the most unfavorable one pessimistically.