Яндекс.Метрика

Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

This post is very long, but please try to read it without skipping. The text naturally builds up on the observations done earlier in the same text. That’s also why the first part is rather boring and slow. Buckle up, and read. If you are running out of steam, take a break, and pick up you where you left. Please read it in entirety before asking questions!
There is a much shorter "Too fast, too megamorphic" post from Richard Warburton (@RichardWarburto), if you need spoilers and warmup. Vyacheslav Egorov (@mraleph) had a high-level post about the similar stuff in "What’s up with monomorphism?".
This post is also available in ePUB and mobi.

Preface

Programming languages like Java provide the facilities for subtyping/polymorphism as one of the ways to construct modular and reusable software. This language choice naturally comes at a price, since there is no hardware support for virtual calls, and therefore runtimes have to emulate this behavior. In many, many cases the performance of method dispatch is not important. Actually, in a vast majority of cases, the low-level performance concerns are not the real concerns.

However, there are cases when method dispatch performance is important, and there you need to understand how dispatch works, what runtimes optimize for you, and what you can do to cheat and/or emulate similar behavior in your code. For example, in the course of String Compression work, we were faced with the problem of selecting the coder for a given String. The obvious and highly maintainable approach of creating a Coder interface, a few implementations, and dispatching the virtual calls over it, had met some performance problems on the very tiny benchmarks. Therefore, we needed to contemplate something better. After a few experiments, this post was born as a reference for others who might try to do the same. This post also tangentially touches the inlining of virtual calls, as the natural thing during the optimization.

As a good tradition, we will take some diversions into benchmarking methodology and general low-level performance engineering, so that even though the post itself is targeted at platform people, the general crowd can still learn a few tricks. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.

This post also assumes a good understanding of Java, Java bytecode, compilers, runtimes, and x86 assembly. Many readers complain my posts lack completeness in the sense that I omit the details how to arrive to a particular conclusion. This time I decided to use lots of notes to highlight how one can decipher the compiler decisions, how to read the generated assembly, etc. It should be fun and educational, but you may also ignore the parts formatted like callouts, if you don’t have time.

Thanks to Ilya Ryzhenkov (@orangy), Gleb Smirnov (@gvsmirnov), Julien Ponge (@jponge), Aurelien Gonnay (@yagonax), Nitsan Wakart (@nitsanw), and Aggelos Biboudis (@biboudis) for reviews and helpful suggestions!

1. Problem

We can formalize the problem as follows. Suppose we have a class Data:

public class Data {
  byte[] data;
}

…​and we want to do different things based on that data. Suppose we have N versions of code, each of those versions provide some sort of meaning for the data. In String Compression, for example, data can be either 1-byte-encoded array, or 2-byte-encoded array. Something somewhere should provide a meaning for that data; in other words, demangle it and/or do useful work. Suppose, for example, we have a "coder" abstraction that does something along the lines of:

public class Coder1 {
  int work(byte[] data) { return omNomNom(data); }
}

public class Coder2 {
  int work(byte[] data) { return cIsForCookie(data); }
}

The question is, what is the best way to implement different coders and dispatch over them?

2. Experimental Setup

2.1. Hardware

While the exact code generation details surely differ among the platforms, in hindsight, most of the behaviors we are about to show are platform-agnostic, and done in high-level optimizers. Because of that, we simplify our lives and run the tests on a single 1x4x2 i7-4790K 4.0 Ghz, Linux x86_64, JDK 8u40 EA. Suspicious readers are welcome to reproduce the results on their respective platforms.

2.2. Cases To Try

Granted, you may just inline the coder implementations straight into Data, but that’s a questionable practice, especially if you want to have multiple coder implementations. Therefore, we consider these ways to implement coders:

  1. Static: Make static implementations of Coders, all with static methods.

  2. Dynamic_Interface: Make Coder a proper interface, and provide the implementations.

  3. Dynamic_Abstract: Same as above, but make Coder the abstract superclass, and provide the implementations.

All these three methods of implementing the behavior need some way to encode it in the Data itself. We can come up with four schemes of such an encoding:

  1. ID_Switch: Store a byte ID, and select the coder by switching over it

  2. ID_IfElse: Store a byte ID, and select the coder by if-else-ing over it

  3. Bool_IfElse: Store a boolean, and select the coder by if-elsing over it. (Works only for N=2)

  4. Ref: Store a Coder reference, and do a virtual call.

There are also other ways to encode, e.g. storing the java.lang.reflect.Method or java.lang.invoke.MethodHandle, but we don’t care about them at this point, because both are somewhat similar to Ref, and also require some advanced compiler magic to work in a performant way. We will cover these in the future.

The choice of the encoding scheme also has to take the footprint into the consideration. Additional field in the Data object can increase the instance size. However, since Java object are usually aligned at 8 bytes, there is a sizeable "alignment shadow" in the object, where one can cram the field without increasing the instance size. Hiding a boolean/byte field there could be easier than putting an entire reference field. JOL can be used to look into field/object layout.

Combining these behaviors and selection schemes, we have quite a few variants already. Thankfully, not all of them are required for a concrete N. However, it would be interesting to dissect each of the relevant ones to understand how runtimes deal with code like this.

2.3. Benchmarks

The source code for the benchmarks is available here. Since we are dealing with a very narrow VM effect, let me explain a few things about the benchmark. This is how a single benchmark case looks like:

public class Test {
    @Param("100")
    private int count;

    private Data[] datas;

    @Setup
    public void setup() {
        datas = new Data[count];

        Random r = new Random();
        for (int c = 0; c < count; c++) {
            byte[] contents = new byte[10];
            r.nextBytes(contents);
            datas[c] = new Data(r.nextInt(2), contents);
        }
    }

    @Benchmark
    public void dynamic_Interface_Ref() {
        Data[] l = datas;
        int c = count;
        for (int i = 0; i < c; i++) {
            l[i].do_Dynamic_Interface_Ref();
        }
    }
}

public class Coder0 implements Coder {
    public int work(byte[] data) {
        return data.length; // something light-weight
    }
}

public class Coder1 implements Coder {
    public int work(byte[] data) {
        return data.length; // something light-weight
    }
}

public class Data {
    private static final Coder0 coder0 = new Coder0();
    private static final Coder1 coder1 = new Coder1();
    private final Coder coder;
    private final byte id;
    private final byte[] data;

    public Data(byte id, byte[] data) {
        this.id = id;
        this.data = data;
        this.coder = interface_ID_Switch();
    }

    private Coder interface_ID_Switch() {
        switch (id) {
            case 0: return coder0;
            case 1: return coder1;
            default:
                throw new IllegalStateException();
        }
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public int do_Dynamic_Interface_Ref() {
        return coder.work(data);
    }
}

A few tricks worth mentioning here:

  1. We do a loop over a few Data classes. That will be important when we start to look into dealing with different coders: runtime should be "poisoned" with the different types of coders at the call site. In the example above, the test has two types of coders, with ID=0 and ID=1.

  2. We hand-optimize the loop in @Benchmark method, because its performance is important when we deal with the interpreter.

  3. The benchmark loop does not consume the payload result. This is done to make the nanobenchmark amplify the costs of the payload itself, not the infrastructure costs. This practice is DANGEROUS and ignores the JMH recommendations. This trick can only be done by the trained professionals, who would then verify it does not obliterate the benchmark, due to e.g. dead code elimination.

  4. The payload method (do_Dynamic_Interface_Ref in this example) is annotated with a special JMH hint that prevents inlining of payload into the benchmark loop. This is an important prerequisite for avoiding dead code elimination, at least on HotSpot today. It also helps to separate the payload body from the remaining code in the assembly dump.

2.4. VM Modes

By default, HotSpot VM these days provides the tiered compilation: first, the code is interpreted, then it’s compiled with a baseline compiler (also known as "client" compiler, or C1), and finally it is compiled with an aggressively optimizing compiler (also known as "server" compiler, or C2).

The actual pipeline is more complicated. VM constants for tiers of compilation shine some light on what each level does. level=1 is pure C1, level=4 is pure C2. Levels in-between are C1 with profiling. The transitions between levels are governed by an advanced policy, and you can talk about them forever.

While most people care about the performance in C2, since the code that gets aggressively optimized is by construction the same code that consumes lots of time, we also care about the performance in pathological modes where the code hasn’t yet made it to C2 compilation.

3. Single Type

If you are inexperienced with walls of assembly, it is a good idea to pay attention to this part.

Let us start with scenario with only a single coder. While this scenario is superficial, because you might as well inline the coder implementation into the data class itself, it still serves as an interesting basic case that we want to untangle before diving into more complicated cases. Here, we only concern ourselves with Ref style of encoding, since other styles have almost no meaning here.

The JMH benchmarks consider a single Coder0 that extends AbstractCoder, and that’s the only subclass of this abstract class. Coder0 also implements Coder interface, and there is only a single interface implementer. Even though we have a singular Coder0 that implements/extends both the abstract class and interface, we try to call for its work either via abstract class (invokevirtual), interface (invokeinterface), or the static method (invokestatic).

public abstract class AbstractCoder {
    public abstract int abstractWork(byte[] data);
}

public interface Coder {
    int work(byte[] data);
}

public class Coder0 extends AbstractCoder implements Coder {
    public static int staticWork(byte[] data) {
        return data.length;
    }

    @Override
    public int work(byte[] data) {
        return data.length;
    }

    @Override
    public int abstractWork(byte[] data) {
        return data.length;
    }
}

3.1. C2

Without further ado, the results are:

Benchmark                  (count)  Mode  Cnt   Score   Error  Units
One.dynamic_Abstract_Ref     10000  avgt   50  30.168 ± 0.602  us/op
One.dynamic_Interface_Ref    10000  avgt   50  31.391 ± 0.067  us/op
One.static_Ref               10000  avgt   50  29.171 ± 0.385  us/op

We can immediately spot the difference: static case is faster than dynamic, and interface case is slower than abstract class. While the difference seems minuscule, in some cases, like ours, it might matter a great deal. Let’s figure out why the performance is different. To do that, we employ JMH’s -prof perfasm profiler, which maps the Linux perf event sampling onto disassembly.

There are other ways to get the low-level profiling and disassembly, including but not limited to dealing with VM yourself, using Solaris Studio Performance Analyzer, or JITWatch and others. Every approach has its advantages and its limits, learn how to employ the right tool for a particular case. JMH perfasm is ideal for analyzing the JMH nanobenchmarks we will deal with in this post.

Editorial note. The assembly listings you will see in this post are slightly different from what you will see in the actual assembly dump from the VM. It would be much larger, but thanks to perfasm you will have to only care about the hottest blocks. It will also have lots of other info, including the machine addresses, Java bytecode instructions, Java stack traces, and other interesting VM comments. We would like to keep you oblivious of those comments at this point, in order to focus on things that do matter now. You can always repeat the tests in your environment and get the full disassembly.

3.1.1. C2: static_Ref

                  [Verified Entry Point]
  0.02%    0.02%    mov    %eax,-0x14000(%rsp)
  9.18%    6.63%    push   %rbp
           0.01%    sub    $0x10,%rsp
                    mov    0x10(%rsi),%r11d       ; get field $data
  9.54%    7.90%    mov    0xc(%r12,%r11,8),%eax  ; Coder0.staticWork() body
 10.79%   10.78%    add    $0x10,%rsp             ; epilogue and return
                    pop    %rbp
                    test   %eax,0x15f7c4e0(%rip)
 15.56%   21.52%    retq

This method yields a simple and straight-forward code, and that is not surprising: we just call the static method. Static methods are also statically bound, which means we don’t need to resolve what method should we call in runtime. We know exactly what methods we are about to call statically, at compile time. The same thing applies to our own private methods, they are also statically resolvable.

How did we figure a thing about Coder0.staticWork() body? Well, the original assembly dump will say that mov 0xc(%r12,%r11,8),%eax is "arraylength" in disguise, but what does it actually mean? What you are looking at is the "decoding" of the compressed reference, merged with the actual read. That instruction is the equivalent of loading %r12 + %r11*8 + 0xc, where %r12 is a base address, %r11 is the compressed address of data, 8 is the compressed references multiplier, and 0xc is an offset of length field within the byte[] array. The presence of these rich addressing modes in current CPUs is one of the reasons why compressed references implementations provide a bearable generated code overhead. When in doubt, you can always dump the assembly with compressed references disabled (-XX:-UseCompressedOops)

Another interesting question is about null checks. Java spec requires to check the byte[] instance for null before polling its length. We know the length is stored within the byte[] instance anyway, so we need a non-null instance in the generated code. Can you spot the null-check in the assembly above? You can’t, because there is no explicit null check there. So, how would runtime fit the specification obligations or at least not crash? The hint is in the full assembly dump, which will say "implicit exception, dispatches to $addr" against the mov instruction. De-referencing the null pointer will cause CPU to cause SEGV signal to the process. Luckily, VM has the SEGV handler armed that can handle this, and throw an appropriate NullPointerException, transfer the control back, etc. Since SEGV event has a return address, VM knows at which point in the generated code the null-check had fired. This optimization greatly improves the performance of ubiquitous null checks guaranteed by Java.

The only remaining part, other than %rbp (stack pointer) handling, is the test %eax,0x15f7c4e0(%rip). This seems cryptic, since the status flags this instruction would set are not used. Why this instruction is even there? If you look into the raw assembly dump, you will see "# poll" comment against that line, this should give you a hint. In short, this is a safepoint poll, the part of VM-Code interface. VM maintains a special memory page that has the READ permissions, and when it needs the generated code to report back as soon as possible, it drops the privileges to NONE, causing the trap when test instruction tries to access it, thus transferring the control to the trap handler, and then to the VM. You can read more about safepoint mechanics somewhere else, e.g. in Nitsan’s "Where is my safepoint?".

3.1.2. C2: dynamic_Abstract_Ref

                  [Verified Entry Point]
  0.44%    0.65%    mov    %eax,-0x14000(%rsp)
  7.93%    6.68%    push   %rbp
  0.23%    0.24%    sub    $0x10,%rsp
  1.46%    0.54%    mov    0x10(%rsi),%r11d       ; get field $data
  7.96%    7.25%    cmp    0x18(%rsi),%r12d       ; null checking $abstractCoder
                    je     NULL_CHECK             ;   $abstractCoder is null, bail!
  2.05%    1.94%    mov    0xc(%r12,%r11,8),%eax  ; Coder0.abstractWork() body, arraylength
 13.19%   14.35%    add    $0x10,%rsp             ; epilogue and return
  0.33%    0.28%    pop    %rbp
  3.08%    3.91%    test   %eax,0x1642359a(%rip)
 11.75%   15.78%    retq
                NULL_CHECK:
                    <omitted>

In human words, it loads the data field, it does the null-check for abstractCoder, because we are about to call a virtual method on it, and then we have the inlined body of Coder0.abstractWork(). The cost of the null check is what we see as the difference against a plain static call.

How did we figure out the thing about null checking? Well, we already know %r12 is a compressed references base address, see the note above. We also know that %rsi is the reference to Data, and we are pulling the field at offset 0x18 from there. You may want to cross-check the object layout with JOL to make sure. Combining these two observations, we figure this compares the object reference to the base address, in other words, checks it for nullity.

One can wonder: what happens if we pulled a different coder from the field and not Coder0? Would we then proceed to the inlined body of Coder0, even though our original program is definitely saying otherwise? The answer is: you are looking at the result of a speculative optimization. VM knows there is only a single subclass of AbstractCoder that could possibly be here (knowledge coming from the Class Hierarchy Analysis, or CHA), and therefore we may skip checking the type. If our tests had a more complex hierarchy down the AbstractCoder, the VM would have to assume something else could enter here. In fact, we will see that once we start to deal with multiple types of coders.

Another puzzling thing about this: Java supports dynamically loaded classes! What happens if we compile the current code speculatively, but then some bastards receive another subclass of AbstractCoder over the network and load it? Everything breaks? The answer is: runtime can detect when the conditions under which the code was compiled are invalidated, and discard the compiled code. This is where runtimes shine: they control the code, they control the environment where the code is run, and therefore can optimize speculatively. Remember the note about the safepoints above? This is one of the things that help VM to quickly get the control back from the generated code when needed.

3.1.3. C2: dynamic_Interface_Ref

                  [Verified Entry Point]
  0.03%    0.02%    mov    %eax,-0x14000(%rsp)
  8.87%    7.69%    push   %rbp
  0.06%    0.03%    sub    $0x20,%rsp
  0.01%    0.02%    mov    0x10(%rsi),%r10d       ; get field $data
  9.20%   10.68%    mov    0x14(%rsi),%r8d        ; get field $coder
  0.01%    0.02%    mov    0x8(%r12,%r8,8),%r11d  ; get the class word for $coder
 14.04%   18.00%    cmp    $0xf80102f6,%r11d      ; compare it with Coder0
                    jne    TYPE_CHECK_FAILED      ;  not our coder, bail!
  7.50%   10.19%    mov    0xc(%r12,%r10,8),%eax  ; Coder0.work() body
  1.74%    1.89%    add    $0x20,%rsp             ; epilogue and return
                    pop    %rbp
  0.04%    0.04%    test   %eax,0x15fdfb4e(%rip)
 10.31%   13.05%    retq
                TYPE_CHECK_FAILED:
                    <omitted>

Notice here that the CHA is not playing on our side. Granted, we could have analyzed the same for interfaces, figure out the interface has only a single implementer, etc., but the reality is that real interfaces usually have lots of implementors, and so implementing and supporting this in the VM does not pass a reasonable cost-benefit analysis.

Therefore, the compiled code has to check the coder type, and the performance difference against simple static case is explained by this. It is also less cheap than null-checking the coder instance, since we need to pull and compare the actual classword.

How did we figure out the classword is being loaded? Remember the note above, we know that we are reading something at offset 0x8 from a compressed address %r8. %r8 is a Coder reference, and so we pull the 8 bytes from 8 byte offset from the object itself. We know that is where a classword lies.

VM does the type checks by comparing the class word with a pre-compiled constant. That constant is actually the canonical address of native VM structure that mirrors the class. Since it’s canonical, and also because VM takes care of not moving it, we can compile the needed address right into the generated code, reducing these checks to just pointer comparison.

3.2. C1

As noted above, the generated code passes a few tiers of compilation before it reaches the heavily-optimized version. Sometimes what you transiently have before reaching the optimal performance also matters, this is why we should also look into C1, telling VM to stop the tiered pipeline at level 1, with -XX:TieredStopAtLevel=1.

Ideally, we would want to limit the tiered compilation level only for a specific benchmark method, but it is simpler to run the entire code on lower level. This is valid since we don’t compare the performance of C1 vs C2, but merely study how a concrete compiler behaves.

Benchmark                  (count)  Mode  Cnt   Score   Error  Units
One.dynamic_Abstract_Ref     10000  avgt   50  41.595 ± 0.049  us/op
One.dynamic_Interface_Ref    10000  avgt   50  36.562 ± 0.058  us/op
One.static_Ref               10000  avgt   50  26.594 ± 0.051  us/op

We can see a slightly different distribution as opposed to the C2 case: the static case is still significantly faster, but now the abstract case is significantly slower that interface one. What gives?

3.2.1. C1: static_Ref

                  [Verified Entry Point]
  8.79%    8.26%    mov    %eax,-0x14000(%rsp)
  2.14%    1.57%    push   %rbp
  0.28%    0.29%    sub    $0x30,%rsp
  9.23%    7.44%    mov    0x10(%rsi),%eax       ; get field $data
  0.77%    0.63%    shl    $0x3,%rax             ;  unpack it
  0.35%    0.37%    mov    0xc(%rax),%eax        ; Coder0.work() body, arraylength
 13.30%   10.77%    add    $0x30,%rsp            ; epilogue and return
  1.16%    0.61%    pop    %rbp
  0.53%    0.32%    test   %eax,0x1650807f(%rip)
 13.50%   13.20%    retq

These cases are again remarkably simple, because we know the call target exactly, as we did in the same C2 case.

The compressed references packing/unpacking code may look like this as well. Notice it is almost the same as in mov 0x10(%rsi),%eax; mov 0xc(%r12,%eax,8) when %r12 is 0. shl $0x3, <anything> instructions are the markers of compressed references decoding going on, if there is no other reasons to do the shifts.

The minute differences in code generation between compilers is not a surprise. What’s surprising is that C2-ish instruction choice is marginally slower than C1-ish instruction choice:

C2: 29.575 ±(99.9%) 1.116 us/op
C1: 26.347 ±(99.9%) 0.178 us/op

Following up on this difference will be interesting, but outside the scope of this work. This observation is another example of the importance of very careful comparisons for the code coming through different toolchains. Read "Java vs. Scala: Divided We Fail" for a funnier example of the same effect.

3.2.2. C1: dynamic_Abstract_Ref

dynamic_Abstract_Ref():
                  [Verified Entry Point]
  5.22%    3.71%    mov    %eax,-0x14000(%rsp)
  1.12%    0.49%    push   %rbp
  4.73%    4.15%    sub    $0x30,%rsp
  0.20%    0.15%    mov    0x18(%rsi),%edi          ; get field $abstractCoder
  0.73%    0.03%    shl    $0x3,%rdi                ;  unpack it
  5.44%    4.78%    mov    0x10(%rsi),%edx          ; get field $data
  0.02%             shl    $0x3,%rdx                ;  unpack it
  0.16%    0.13%    mov    %rdi,%rsi
  0.72%    0.04%    mov    $0xffffffffffffffff,%rax
  4.89%    5.88%    callq  0x00007f7ae1046220       ; VIRTUAL CALL to abstractWork()
  6.01%    6.08%    add    $0x30,%rsp
  0.45%    0.40%    pop    %rbp
  0.28%    0.11%    test   %eax,0x17737229(%rip)
  4.91%    6.09%    retq

abstractWork():
                  [Verified Entry Point]
  0.39%    0.18%    mov    %eax,-0x14000(%rsp)
  5.39%    5.52%    push   %rbp
  0.20%    0.13%    sub    $0x30,%rsp
  0.53%    0.55%    mov    0xc(%rdx),%eax           ; arraylength
  6.10%    5.79%    add    $0x30,%rsp               ; epilogue and return
  0.18%    0.09%    pop    %rbp
  0.35%    0.46%    test   %eax,0x17736ee6(%rip)
  0.45%    0.32%    retq

First of all, you have to notice the abstractWork method is not inlined. This is because C1 is not able to figure out the static target for the call: the call target is not trivial, and the CHA is not working for abstract classes here.

One can untangle that as follows, run with -XX:+PrintCompilation -XX:+PrintInlining. For this method in C1 compile we will see this "cryptic" message:

net.shipilev.one.One::do_Dynamic_Abstract_Ref (12 bytes)
   @ 8   net.shipilev.one.AbstractCoder::abstractWork (0 bytes)   no static binding

Now, if you search the HotSpot source for "no static binding", you will find a relevant part of C1 GraphBuilder, which implies the inlining of this method will happen if it’s static/private/dynamic, or otherwise final. Placing the final modifier over Coder0.abstractWork would not help here, because compiler tries to verify if AbstractCoder.abstractWork is final or not. CHA might have saved us here, but the block above explicitly rules it out (the reason for that is unknown to me).

Somebody asked me what’s that mov $0xffffffffffffffff,%rax is doing before the call. Obviously, that’s -1 in two-complement form. My speculation is that -1 signifies the empty inline cache to the C1 method dispatcher. Allegedly emitted here.

3.2.3. C1: dynamic_Interface_Ref

                  [Verified Entry Point]
  7.63%    5.23%    mov    %eax,-0x14000(%rsp)
  0.09%    0.22%    push   %rbp
  7.13%   10.31%    sub    $0x30,%rsp
  0.35%    0.10%    mov    0x14(%rsi),%eax      ; get field $coder
  0.05%    0.16%    shl    $0x3,%rax            ;  unpack it
  6.84%   10.14%    mov    0x10(%rsi),%esi      ; get field $data
  0.16%    0.41%    shl    $0x3,%rsi            ;  unpack it
  0.13%    0.11%    cmp    $0x0,%rax            ; null-check
                    je     WORK                 ;  bail to implicit null-check handling

  0.09%    0.06%    mov    $0x7c00817b0,%rcx    ; load Coder0 metadata
  6.50%    7.65%    mov    0x8(%rax),%edx       ; load classword for $coder
  0.31%    0.38%    shl    $0x3,%rdx            ;  unpack it
  0.49%    0.61%    cmp    0x38(%rdx),%rcx      ; check if $coder is actually Coder0
                    jne    TYPE_CHECK_FAILED    ;  not? bail.
 27.33%   30.35%    jmpq   WORK                 ; GO HOME COMPILER, YOU ARE DRUNK

                 WORK:
  0.31%    0.25%    mov    %rax,%rdi
  0.07%    0.04%    cmp    (%rax),%rax          ; implicit null check
  0.05%             mov    0xc(%rsi),%eax       ; Coder0.work() body, arraylength
  7.03%    7.82%    add    $0x30,%rsp           ; epilog and return
  0.19%    0.30%    pop    %rbp
  0.08%    0.06%    test   %eax,0x1639b908(%rip)
  0.15%    0.08%    retq

                TYPE_CHECK_FAILED:
                    <omitted>

The interface call is inlined more or less nicely, but there is a significant dance for type checking. The largest part of this method seems to wait on cmp 0x38(%rdx),%rcx and associated jumps, although it’s hard to tell what’s the problem exactly: it may be a cache miss on getting 0x38(%rdx) (though unlikely in a nano-benchmark), or some weird pipeline effect with two jumps following each other. Either way, the jmpq there seems suspicious, since we might as well fall-through. This can be fixed with another pass of peephole optimizer that could at least nop out the jmpq, but remember we are talking about C1, which goal is to produce the code fast, even at the expense of it’s quality.

How did we figure cmp 0x38(%rdx),%rcx is a type check? First, we know that before that instruction the %rdx register contained the classword, because we loaded it from the offset of 0x8 from the object itself, see the notes above. We also know %rdx is the address of some native data structure in the VM. Now, figuring out the layout of that native structure is somewhat hard. As a bullet-proof solution, you would employ the Serviceability Agent for a task like this. For our study, it is enough to figure out that we compare the actual class with a expected class, hence this is a type check.

3.3. Interpreter

The interpreter tests are interesting because you don’t expect an optimizing compiler to come and save you. That’s certainly true, and if you run the tests with -Xint, you will see something like this:

Benchmark                  (count)  Mode  Cnt     Score    Error  Units
One.dynamic_Abstract_Ref     10000  avgt   50  1149.768 ±  7.131  us/op
One.dynamic_Interface_Ref    10000  avgt   50  1154.198 ± 11.030  us/op
One.static_Ref               10000  avgt   50  1137.694 ±  7.405  us/op

It would seem that the interpreter is so slow, the cost of the different call itself is drowning there. However, if you profile the interpreter with something like JMH perfasm, then you will see that we are spending the most of the time doing stack banging. Working that around gives a nice performance boost with -Xint -XX:StackShadowPages=1 gives:

Benchmark                  (count)  Mode  Cnt    Score   Error  Units
One.dynamic_Abstract_Ref     10000  avgt   50  369.172 ± 2.024  us/op
One.dynamic_Interface_Ref    10000  avgt   50  388.112 ± 3.103  us/op
One.static_Ref               10000  avgt   50  344.355 ± 3.402  us/op

Meddling with -XX:StackShadowPages may undermine the JVM ability to deal with stack overflow errors. Do not use this blindly in production. Read more about stack overflows here. We are reducing the number of shadow pages here to magnify the effect we are after.

As in C2 case, the static cases are faster, and interface case is slower than abstract one. Let us look into…​ the generated code! We wouldn’t spend much time here, but just outline a few key things.

As unusual as it sounds, our interpreter generates the code. For each bytecode instruction it generates one or more platform-specific stubs that together implement the abstract bytecode executing machine. With Futamura projections, one can easily argue this is actually a template compiler. The established term for this implementation is "template interpreter" though, because it does interpret the bytecode in a (weird, but) straight-forward way. The debate how would one call such an implementation further outlines that the line separating an interpreter and a compiler is very thin, if it exists at all.

The assembly for the stubs are the "fun" reading, and we would not go there. Instead, we will do a differential analysis, since the stubs are roughly the same for all the cases, and only their time distributions differ.

This is static_Ref:

....[Hottest Methods (after inlining)]................................
 15.33%   21.46%  <stub: invoke return entry points>
 13.89%    8.23%  <stub: method entry point (kind = zerolocals)>
 12.77%   14.76%  <stub: invokevirtual  182 invokevirtual>
 10.07%    9.69%  <stub: fast_aaccess_0  221 fast_aaccess_0>
  8.42%   13.37%  <stub: ireturn  172 ireturn>
  7.10%    5.56%  <stub: throw exception entrypoints>
  5.02%    2.63%  <stub: pop  87 pop>
  4.14%    2.90%  <stub: iload_3  29 iload_3>
  3.95%    6.49%  <stub: invokestatic  184 invokestatic>
  3.61%    1.08%  <stub: if_icmpge  162 if_icmpge>
  2.39%    2.36%  <kernel>

We have five large stubs of interest:

  1. fast_aaccess_0, for reading the array elements, remember our infrastructure code pulls the targets off an array

  2. invokevirtual, for invoking the virtual payload calls; with no inlining we still have to make a call

  3. method entry point, executed before entering the payload method body

  4. invoke return, executed before leaving the payload method body

  5. invokestatic, executing our payload; this stub is not very hot, because we spend most of the time dealing with larger problems in the infrastructure code itself.

For comparison, this is dynamic_Abstract_Ref:

....[Hottest Methods (after inlining)]................................
 18.98%   16.18%  <stub: invokevirtual  182 invokevirtual>
 16.82%    8.74%  <stub: method entry print (kind = zerolocals)>
 12.54%   20.40%  <stub: throw exception entrypoints>
 12.04%   10.56%  <stub: fast_aaccess_0  221 fast_aaccess_0>
  7.46%   12.45%  <stub: invoke return entry points>
  6.44%    6.24%  <stub: ireturn  172 ireturn>
  4.36%    6.94%  <stub: pop  87 pop>
  3.76%    2.69%  <stub: iload_3  29 iload_3>
  3.03%    0.60%  <stub: if_icmpge  162 if_icmpge>
  2.32%    2.27%  <kernel>

It is different from the previous one in the sense that invokestatic is gone, and replaced by invokevirtual for our abstract call, and we also have the throw exception stub, which is here to handle possible NPEs. This difference explains why static_Ref is faster — for the same reason it is faster in compiled versions. If the call receiver is known, everything gets easier.

dynamic_Interface_Ref is roughly the same, but having invokeinterface stub:

....[Hottest Methods (after inlining)]................................
 17.35%   18.01%  <stub: invokeinterface  185 invokeinterface>
 12.93%    8.52%  <stub: method entry point (kind = zerolocals)>
 11.66%   11.30%  <stub: fast_aaccess_0  221 fast_aaccess_0>
 11.62%    9.44%  <stub: throw exception entrypoints>
  9.87%   11.01%  <stub: invokevirtual  182 invokevirtual>
  6.74%   11.34%  <stub: invoke return entry points>
  6.38%    6.33%  <stub: ireturn  172 ireturn>
  3.60%    2.10%  <stub: if_icmpge  162 if_icmpge>
  3.43%    5.79%  <stub: pop  87 pop>
  2.62%    1.81%  <stub: aload_1  43 aload_1>
  2.49%    2.40%  <stub: iload_3  29 iload_3>
  2.23%    2.24%  <kernel>

3.4. Discussion I

Looking at the case of a single coder, we can already spot a few things:

  1. If you know a call receiver exactly at development/compile time, it is a good idea to call it statically. This practice actually coincides with good engineering: if the method behavior does not depend on the receiver state, it might as well be static to begin with.

  2. C2 will try to guess the receiver based on profile, but it has to do the defensive type checks and/or null checks, as mandated by the language specification. While this is more than enough in a vast majority of cases, the inherent costs of these mandatory checks are sometimes visible on a very high magnification.

  3. C1 is sloppy when it comes to exploiting the static type information, as well as profile. This may affect warmup times and time-to-performance. It’s an open question if you can implement the C2-like smart moves, without compromising the compilation time.

To reiterate, the method invocation performance is not a concern most of the time, and the VM optimizations are actually closing the gap very tightly. We will see what happens when they can’t, as we go on.

4. Two Types

Now that we learned some basic stuff about the single type calls, let’s see how runtimes handle the calls with different targets. The simplest way to achieve that in benchmark would be to modify our single coder benchmark to segregate two Coders.

@Param({"0.0", "0.1", "0.5", "0.9", "1.0"})
private double bias;

private Data[] datas;

@Setup
public void setup() {
    Random r = new Random(12345);

    List<Data> ts = new ArrayList<Data>();
    for (int c = 0; c < count; c++) {
         byte[] contents = new byte[10];
         r.nextBytes(contents);
         byte id = (byte) (c < (bias * count) ? 1 : 0);
         ts.add(new Data(id, contents));
    }
    Collections.shuffle(ts, r);
    datas = ts.toArray(new Data[0]);
}

This further complicates benchmarking because now we have to decide what distribution of coders to measure. Should it be 50/50? Should it be 1/42? In hindsight, we are going to introduce the bias parameter which will say what ratio does Coder0 consume in the distribution, and measure the benchmarks at 0.0, 0.1, 0.5, 0.9, and 1.0 ratios.

Measuring only a single bias is a benchmarking error. It has a chance to put the benchmark in some specific point condition, that might be not what the real code is experiencing. You should try to poke at different settings to see how the benchmark responds. See more discussion in "Nanotrusting the Nanotime". Our first slab of experiments did 0.0, 0.5, and 1.0, but we figured all these cases are degenerate in one way or the other, see below. That’s why we added a bit more realistic 0.1 and 0.9.

4.1. Reference Tests

Again, let us "eat the elephant one bite at a time", and start from the tests that we are already familiar with, the Ref family. static_Ref is not applicable here anymore, because there is no way to select two coder implementations with static methods and nothing else (we will see how auxiliary selectors perform later). That leaves dynamic_Interface_Ref and dynamic_Abstract_Ref.

4.1.1. C2: dynamic_…​_Ref

If we run these tests with C2, we will see:

Benchmark                  (bias)  (count)  Mode  Cnt    Score   Error  Units
Two.dynamic_Interface_Ref     0.0    10000  avgt   50   52.701 ± 0.470  us/op
Two.dynamic_Interface_Ref     0.1    10000  avgt   50   57.760 ± 0.542  us/op
Two.dynamic_Interface_Ref     0.5    10000  avgt   50   94.744 ± 0.035  us/op
Two.dynamic_Interface_Ref     0.9    10000  avgt   50   59.046 ± 1.595  us/op
Two.dynamic_Interface_Ref     1.0    10000  avgt   50   50.857 ± 0.252  us/op

Two.dynamic_Abstract_Ref      0.0    10000  avgt   50   54.419 ± 1.733  us/op
Two.dynamic_Abstract_Ref      0.1    10000  avgt   50   58.244 ± 0.652  us/op
Two.dynamic_Abstract_Ref      0.5    10000  avgt   50   95.765 ± 0.180  us/op
Two.dynamic_Abstract_Ref      0.9    10000  avgt   50   58.618 ± 0.843  us/op
Two.dynamic_Abstract_Ref      1.0    10000  avgt   50   52.412 ± 0.185  us/op

Notice a few things about these results already. First, the performance depends on the bias; we can already speculate why, but we will dive into the exact reasons a bit later. Second, the performance difference is symmetrical around 1/2 bias, which already gives you a few hints about what’s happening. Third, the abstract case and the interface case seem to perform the same.

All right, let us disassemble some stuff! Take dynamic_Interface_Ref with bias = 0.0, that is, only Coder0 instances are present in the code:

                  [Verified Entry Point]
  0.03%    0.07%    mov    %eax,-0x14000(%rsp)
  8.36%    7.19%    push   %rbp
  0.03%    0.03%    sub    $0x20,%rsp
                    mov    0x10(%rsi),%r10d       ; get field $data
  8.83%   10.53%    mov    0x14(%rsi),%r8d        ; get field $coder
  0.05%    0.03%    mov    0x8(%r12,%r8,8),%r11d  ; load $coder.<classword>
 12.54%   16.32%    cmp    $0xf801040a,%r11d      ; check $coder is Coder0
                    jne    SLOW_PATH              ;   not? bail
  6.87%    9.52%    mov    0xc(%r12,%r10,8),%eax  ; Coder0.work() body
  2.72%    2.52%    add    $0x20,%rsp             ; epilogue and return
           0.02%    pop    %rbp
  0.03%    0.02%    test   %eax,0x171b1d0e(%rip)
 10.09%   11.50%    retq

                  SLOW_PATH:
                    mov    $0xffffffde,%esi       ; handle the very rare case
                    mov    %r8d,%ebp
                    mov    %r10d,(%rsp)
                    callq  0x00007f4d310051a0     ; call into runtime <uncommon trap>

Notice anything? The code looks similar to what we had before, but this time we check the actual type of coder, and then call into runtime if we see something else, not Coder0. The VM was actually smart enough to figure the only type here is Coder0.

It is sometimes enlightening to view the type profile per call site. It’s doable with -XX:+PrintCompilation -XX:+PrintInlining (and additional -XX:+TraceTypeProfile for older VMs). For this test, it will indeed say there is only one type (Coder0) was observed:

 net.shipilev.two.Data::do_Dynamic_Interface_Ref (14 bytes)
   @ 8   net.shipilev.two.Coder0::work (3 bytes)   inline (hot)
     \-> TypeProfile (6391/6391 counts) = net/shipilev/two/Coder0

What’s an "uncommon trap"? That’s another part of VM-Code interface. When compiler knows some branch/case is unlikely, it can emit a simple call back to VM instead of generating a whole lot of not-to-be-used code. This can greatly cut the compile times, as well as the generated code footprint, and provide denser hot paths, e.g. in loops. In this case, stepping on that uncommon trap may also mean our profile information about coder being only the Coder0 is outdated, and we need to re-profile, and recompile.

dynamic_Interface_Ref with bias = 0.1 is more interesting:

                  [Verified Entry Point]
  0.02%    0.12%    mov    %eax,-0x14000(%rsp)
  7.98%    7.81%    push   %rbp
  0.08%    0.07%    sub    $0x20,%rsp
  0.64%    0.69%    mov    0x10(%rsi),%r10d       ; get field $data
  7.62%    8.67%    mov    0x14(%rsi),%r8d        ; get field $coder
  0.13%    0.03%    mov    0x8(%r12,%r8,8),%r11d  ; load $coder.<classword>
 13.64%   19.45%    cmp    $0xf801040a,%r11d      ; check $coder is Coder0
                    jne    CODER_1                ;   not? jump further
  6.36%    8.34%    mov    0xc(%r12,%r10,8),%eax  ; Coder0.work() body

                 EPILOG:
  1.21%    1.07%    add    $0x20,%rsp             ; epilog and return
  0.03%             pop    %rbp
  0.51%    0.62%    test   %eax,0x1874c24e(%rip)
  8.85%   11.10%    retq

                 CODER_1:
  1.18%    1.40%    cmp    $0xf8012585,%r11d      ; check if $coder is Coder1
                    jne    SLOW_PATH              ;   not? jump further
  0.52%    0.58%    mov    0xc(%r12,%r10,8),%eax  ; Coder1.work() body
  0.02%             jmp    EPILOG                 ; jump back to epilog

                 SLOW_PATH:
                    mov    $0xffffffc6,%esi       ; handle the very rare case
                    mov    %r8d,%ebp
                    mov    %r10d,(%rsp)
                    callq  0x00007f2b2d0051a0     ; call into runtime <uncommon trap>

Now we actually have the two real coder implementations. Notice, however, the codepath leading to Coder1 is pushed out from the straight path, and the more frequent case of Coder0 is taking it’s place. If we were to run with bias = 0.9, the places of Coder0 and Coder1 would reverse here, because profile says so, and code layouter consults with it.

In other words, compilers use frequency-based basic block layout, that puts the basic blocks on a straight path based on the transition frequencies between the basic blocks. That has a handy effect for machines: the most frequent code is linear, which plays nicely with instruction caches and decoders. This also has a handy effect on humans: if you look at the compare-and-branch sequence in the generated code, then the unlikely branch will probably be fanned out, and the likely branch will follow through.

Same test with bias = 0.5:

                  [Verified Entry Point]
  1.80%    2.27%    mov    %eax,-0x14000(%rsp)
  5.28%    5.51%    push   %rbp
  0.47%    0.35%    sub    $0x20,%rsp
  3.00%    3.95%    mov    0x10(%rsi),%r10d       ; get field $data
  4.86%    6.28%    mov    0x14(%rsi),%r8d        ; get field $coder
  0.10%    0.10%    mov    0x8(%r12,%r8,8),%r11d  ; load $coder.<classword>
 13.92%   17.49%    cmp    $0xf8012585,%r11d      ; check if $coder is Coder1
                    je     CODER_1                ;   jump further
  4.96%    6.58%    cmp    $0xf801040a,%r11d      ; check if $coder is Coder0
                    jne    SLOW_PATH              ;   not? jump out
  2.70%    2.93%    mov    0xc(%r12,%r10,8),%eax  ; Coder0.work() body
  1.42%    0.93%    jmp    EPILOG                 ; jump to epilog

                  CODER_1:
  3.78%    5.48%    mov    0xc(%r12,%r10,8),%eax  ; Coder1.work() body

                  EPILOG:
  2.20%    1.77%    add    $0x20,%rsp             ; epilog and return
  2.11%    1.88%    pop    %rbp
  2.21%    1.68%    test   %eax,0x15e9c93e(%rip)
  4.41%    4.23%    retq

                  SLOW_PATH:
                    mov    $0xffffffc6,%esi
                    mov    %r8d,%ebp
                    mov    %r10d,(%rsp)
                    callq  0x00007f8737f6b1a0     ; call into runtime <uncommon trap>

Now, because of runtime profile affecting the block layout, this is arguably is a degenerate case, where the tiny swing to either Coder0 or Coder1 case can affect the layout decision, potentially leading to a great run-to-run variance. From performance standpoint, however, such a distribution means lots of mispredicted branches, which will affect the performance on the modern CPUs.

For example, if we run under perf (JMH conveniently provides the support with -prof perf), then we will see progressively worse branch miss ratios. 8% branch miss ratio is rather bad, and the hit to IPC (Instructions Per Cycle) is enormous.

bias=0.0:

   197,299,178,798 cycles           #    3.977 GHz
   323,821,144,134 instructions     #    1.64  insns per cycle
    47,692,252,113 branches         #  961.278 M/sec
        16,565,723 branch-misses    #    0.03% of all branches

bias=0.1:

   197,078,370,315 cycles           #    3.976 GHz
   295,840,052,712 instructions     #    1.50  insns per cycle
    44,917,582,623 branches         #  906.143 M/sec
       836,907,280 branch-misses    #    1.86% of all branches

bias=0.5:

   197,541,017,422 cycles           #    3.977 GHz
   183,698,062,858 instructions     #    0.93  insns per cycle
    31,674,184,694 branches         #  637.693 M/sec
     2,619,451,036 branch-misses    #    8.27% of all branches

One might wonder, why wouldn’t we sort the datas array before doing the experiments, wouldn’t that help? Sure, it will help branch prediction in these tests, but this is bad benchmarking, for three reasons. First, this "lucky" coder distribution is unlikely in real life. We can live with that as long as we are doing synthetic benchmarks anyway. Second, we need to take the hardware effects into the account, at least in order to estimate how the effect we are after compares with something we can’t really control.

Third, if you have a large sorted array, then chances are you will trip the profiler half-way through processing the array, and therefore skew the type profile. Imagine the array of 10K elements, first half populated with the elements of one type, and the other half with the elements of another. If compiler trips after traversing the first 7.5K, then the "observed" type profile would be 66.(6)% of type 1, and 33.(3)% of type 2, therefore obliterating the experimental setup.

dynamic_Abstract_Ref behaves almost exactly the same, and generates almost the same code, which explains why the performance is similar to dynamic_Interface_Ref.

4.1.2. C1: dynamic_…​_Ref

Benchmark                  (bias)  (count)  Mode  Cnt    Score   Error  Units
Two.dynamic_Interface_Ref     0.0    10000  avgt   50   59.592 ± 0.379  us/op
Two.dynamic_Interface_Ref     0.1    10000  avgt   50   88.052 ± 0.352  us/op
Two.dynamic_Interface_Ref     0.5    10000  avgt   50  136.241 ± 0.426  us/op
Two.dynamic_Interface_Ref     0.9    10000  avgt   50   89.264 ± 1.142  us/op
Two.dynamic_Interface_Ref     1.0    10000  avgt   50   59.285 ± 1.478  us/op

Two.dynamic_Abstract_Ref      0.0    10000  avgt   50   57.449 ± 2.096  us/op
Two.dynamic_Abstract_Ref      0.1    10000  avgt   50   74.185 ± 1.606  us/op
Two.dynamic_Abstract_Ref      0.5    10000  avgt   50  120.458 ± 0.397  us/op
Two.dynamic_Abstract_Ref      0.9    10000  avgt   50   75.714 ± 1.695  us/op
Two.dynamic_Abstract_Ref      1.0    10000  avgt   50   58.020 ± 1.606  us/op

C1 experiences the same symmetry around bias = 0.5, but now the Interface and Abstract cases are clearly different. If you disassemble both, you will see that in both cases the calls are not inlined, and look similarly to C1 version in single type case. However, there are also additional pieces of hot code, attributed to the VM stubs that do the actual vtable (virtual calls table) and itable (interface calls table) dispatches.

This is how the vtable stub looks like:

                  Decoding VtableStub vtbl[5]@12
  1.68%    1.32%    mov    0x8(%rsi),%eax    ; unpack class word
  3.50%    3.34%    shl    $0x3,%rax
  1.69%    1.84%    mov    0x1e0(%rax),%rbx  ; select vtable
  9.21%    8.17%    jmpq   *0x40(%rbx)       ; select vtable[0x40], and jump

This is how the itable stub looks like:

                  Decoding VtableStub itbl[0]@12
  1.86%    0.90%    mov    0x8(%rsi),%r10d          ; unpack classword
  1.80%    1.73%    shl    $0x3,%r10
  1.07%    0.71%    mov    0x120(%r10),%r11d        ; select interface table
  7.27%    5.22%    lea    0x1b8(%r10,%r11,8),%r11  ; ...and lookup the interface below
  4.09%    4.13%    lea    (%r10),%r10
  0.03%    0.06%    mov    (%r11),%rbx
  7.98%   12.49%    cmp    %rbx,%rax
                    test   %rbx,%rbx
                    je     0x00007f9c4d117cca
                    mov    (%r11),%rbx
                    cmp    %rbx,%rax
                    jne    0x00007f9c4d117caa
  2.55%    2.60%    mov    0x8(%r11),%r11d
  0.98%    1.48%    mov    (%r10,%r11,1),%rbx
  8.43%   11.33%    jmpq   *0x40(%rbx)             ; select target method, and jump

Note that interface selection is generally larger, because dispatching the interface call takes much more work, and this explains the performance difference between dynamic_Interface_Ref and dynamic_Abstract_Ref cases.

The interface calls are interesting in a way you have to deal with the dispatch. Virtual calls are easy in the sense that we can construct the vtables for entire class hierarchy on the spot, and have just a single indirection off the vtable to resolve the call target. Interface calls, however, are more complicated: we only observe the interface klass, not the implementation class, and therefore we can’t predict at which offset in concrete class vtable to look for the interface call target.

Therefore, we have to consult the interface table for the given object reference, figure out the vtable for that particular interface, and then call off the resolved vtable. The interface table lies in generic location within each class, and therefore itable (interface table) stub can just look up there. You may learn more about the runtime representation of vtables and itables in HotSpot source.

4.1.3. Interpreter

Benchmark                  (bias)  (count)  Mode  Cnt    Score    Error  Units
Two.dynamic_Interface_Ref     0.0    10000  avgt   50  477.109 ±  9.907  us/op
Two.dynamic_Interface_Ref     0.1    10000  avgt   50  469.529 ±  2.929  us/op
Two.dynamic_Interface_Ref     0.5    10000  avgt   50  482.693 ±  5.955  us/op
Two.dynamic_Interface_Ref     0.9    10000  avgt   50  473.310 ±  4.758  us/op
Two.dynamic_Interface_Ref     1.0    10000  avgt   50  480.534 ±  4.382  us/op

Two.dynamic_Abstract_Ref      0.0    10000  avgt   50  454.018 ±  1.001  us/op
Two.dynamic_Abstract_Ref      0.1    10000  avgt   50  459.251 ±  0.366  us/op
Two.dynamic_Abstract_Ref      0.5    10000  avgt   50  458.424 ±  2.020  us/op
Two.dynamic_Abstract_Ref      0.9    10000  avgt   50  464.937 ±  7.386  us/op
Two.dynamic_Abstract_Ref      1.0    10000  avgt   50  454.475 ±  1.075  us/op

The difference between the invokevirtual and invokeinterface dispatch also explains why interpreter tests for dynamic_Abstract_Ref are faster than dynamic_Interface_Ref. We would not dive further into this. Interested readers are invited to profile these scenarios if they feel important of them.

4.1.4. Discussion II.1

We have seen that the cost of dispatching across two types varies greatly between the compilers and interpreters.

  1. The most common observable effect is the difference between invokevirtual and invokeinterface. While the good engineering practice tells us to use interfaces where possible (especially functional interfaces for lambdas), the plain virtual calls can be much cheaper since they don’t involve the complicated call target lookups. It is not the immediate concern for most of the code that would be compiled with C2 though. But both C1 and interpreter demonstrate the measurable difference between calling via invokevirtual and invokeinterface.

  2. C2 does an interesting profile-guided optimization based on the observed type profile. If there is only a single receiver type (that is, the call site is monomorphic), it can simply check for the predicted type, and inline the target directly. The same optimization can and will be applied if there are two receiver types observed (that is, the call site is bimorphic), at the cost of two branches.

  3. Even with aggressive bimorphic inlining, the hardware has problems with mispredicted branches that will inevitably be present if the types are non-uniformly distributed. However, this is still better than going via the vtable/itable stubs, and experience the same mispredict there, as evidenced by comparing C2 vs. C1.

4.2. Cheating The Runtime

All right, now that we know the problems with generic dispatch over virtual and interface calls, can we cheat? We already know that static calls are perfectly resolved by most compilers, and even the interpreter. The caveat is that we can’t dispatch multiple types over the static method.

Can we implement the auxiliary selectors though? That is, can we store some field in the Data class itself, and dispatch across static calls by ourselves? Let’s implement these:

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Static_ID_switch() {
    switch (id) {
        case 0: return Coder0.staticWork(data);
        case 1: return Coder1.staticWork(data);
        default:
            throw new IllegalStateException();
    }
}

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Static_ID_ifElse() {
    if (id == 0) {
        return Coder0.staticWork(data);
    } else if (id == 1) {
        return Coder1.staticWork(data);
    } else {
        throw new IllegalStateException();
    }
}

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Static_Bool_ifElse() {
    if (isCoder0) {
        return Coder0.staticWork(data);
    } else {
        return Coder1.staticWork(data);
    }
}

…​and compare with the best known dynamic way to dispatch, dynamic_Abstract_Ref.

4.2.1. C2: static_…​

Benchmark                 (bias)  (count)  Mode  Cnt    Score    Error  Units

# Reference
Two.dynamic_Abstract_Ref     0.0    10000  avgt   50   54.419 ± 1.733  us/op
Two.dynamic_Abstract_Ref     0.1    10000  avgt   50   58.244 ± 0.652  us/op
Two.dynamic_Abstract_Ref     0.5    10000  avgt   50   95.765 ± 0.180  us/op
Two.dynamic_Abstract_Ref     0.9    10000  avgt   50   58.618 ± 0.843  us/op
Two.dynamic_Abstract_Ref     1.0    10000  avgt   50   52.412 ± 0.185  us/op

# if (coder0) Coder0.staticWork(data) else Coder1.staticWork(data);
Two.static_Bool_ifElse       0.0    10000  avgt   50   50.206 ± 0.730  us/op
Two.static_Bool_ifElse       0.1    10000  avgt   50   55.036 ± 0.686  us/op
Two.static_Bool_ifElse       0.5    10000  avgt   50   88.445 ± 0.508  us/op
Two.static_Bool_ifElse       0.9    10000  avgt   50   54.730 ± 0.733  us/op
Two.static_Bool_ifElse       1.0    10000  avgt   50   48.966 ± 1.338  us/op

# if (id == 0) Coder0.staticWork(data) else if (id == 1) Coder1.staticWork(data) ...
Two.static_ID_ifElse         0.0    10000  avgt   50   48.506 ± 0.274  us/op
Two.static_ID_ifElse         0.1    10000  avgt   50   56.479 ± 1.092  us/op
Two.static_ID_ifElse         0.5    10000  avgt   50   88.368 ± 0.205  us/op
Two.static_ID_ifElse         0.9    10000  avgt   50   55.940 ± 0.935  us/op
Two.static_ID_ifElse         1.0    10000  avgt   50   48.090 ± 0.406  us/op

# switch (id) case 0: Coder0.staticWork(data); break; case 1: Coder1.staticWork(data); break;
Two.static_ID_switch         0.0    10000  avgt   50   50.641 ± 2.770  us/op
Two.static_ID_switch         0.1    10000  avgt   50   57.856 ± 1.156  us/op
Two.static_ID_switch         0.5    10000  avgt   50   89.138 ± 0.265  us/op
Two.static_ID_switch         0.9    10000  avgt   50   56.312 ± 1.045  us/op
Two.static_ID_switch         1.0    10000  avgt   50   48.931 ± 1.009  us/op

As we can see, the cheating somewhat helps, but why? We have seen that monomorphic and bimorphic inlining is actually quite good, how can we win against that? Let’s disassemble stuff.

This is static_Bool_ifElse and bias = 0.0:

                  [Verified Entry Point]
  1.24%    0.73%    mov    %eax,-0x14000(%rsp)
  3.01%    2.43%    push   %rbp
                    sub    $0x20,%rsp
  1.56%    0.76%    movzbl 0xc(%rsi),%r11d        ; get field $isCoder0
 15.25%   20.43%    test   %r11d,%r11d
                    je     SLOW_PATH              ; jump out if  ($isCoder0 == false)
  1.12%    1.12%    mov    0x10(%rsi),%r11d       ; get field $data
  5.42%    7.05%    mov    0xc(%r12,%r11,8),%eax  ; Coder0.work() body(), arraylength
 37.71%   32.57%    add    $0x20,%rsp             ; epilogue and return
  1.51%    1.67%    pop    %rbp
  0.10%             test   %eax,0x1814ecd6(%rip)
  5.71%    4.06%    retq

                  SLOW_PATH:
                    mov    %rsi,%rbp              ; handle the rare case
                    mov    %r11d,(%rsp)
                    mov    $0xffffff65,%esi
                    callq  0x00007f32910051a0     ; call into <uncommon trap>

Notice how similar it is to monomorphic inlining! We even have the uncommon trap on the cold branch. The only difference against the dynamic_Abstract_Ref case is that we don’t mess with comparing the classword, which saves some instructions. This explains the minuscule improvement.

static_Bool_ifElse and bias = 0.1 responds similarly to the changed bias:

                  [Verified Entry Point]
  1.07%    0.45%    mov    %eax,-0x14000(%rsp)
  3.04%    3.43%    push   %rbp
  0.03%    0.07%    sub    $0x10,%rsp
  1.36%    0.57%    mov    0x10(%rsi),%r11d       ; get field $data
 23.73%   31.22%    movzbl 0xc(%rsi),%r10d        ; get field $isCoder0
  4.25%    4.70%    test   %r10d,%r10d
                    je     CODER_1                ; jump out if ($isCoder0 == false)
  1.98%    2.53%    mov    0xc(%r12,%r11,8),%eax  ; Coder0.work() body, arraylength

                  EPILOG:
 21.30%   17.72%    add    $0x10,%rsp             ; epilogue and return
  1.10%    1.25%    pop    %rbp
  0.90%    0.47%    test   %eax,0x187a7416(%rip)
  5.87%    3.99%    retq

                  CODER_1:
  2.14%    2.03%    mov    0xc(%r12,%r11,8),%eax  ; Coder1.work() body, arraylength
  4.90%    4.51%    jmp    EPILOG                 ; jump back

static_Bool_ifElse and bias = 0.5 is even tighter:

                  [Verified Entry Point]
  0.14%    0.10%    mov    %eax,-0x14000(%rsp)
  2.93%    2.70%    push   %rbp
           0.02%    sub    $0x10,%rsp
  0.17%    0.19%    mov    0x10(%rsi),%r11d       ; get field $data
 24.07%   27.26%    movzbl 0xc(%rsi),%r10d        ; get field $isCoder0
  3.93%    3.69%    test   %r10d,%r10d
                    jne    CODER_0                ; jump out if ($isCoder0 == false)
  6.05%    5.85%    mov    0xc(%r12,%r11,8),%eax  ; Coder1.work() body, arraylength
 13.12%   16.53%    jmp    EPILOG                 ; jump to epilogue

                  CODER_0:
  5.95%    6.13%    mov    0xc(%r12,%r11,8),%eax  ; Coder0.work() body, arraylength

                  EPILOG:
 11.27%   15.97%    add    $0x10,%rsp             ; epilogue and return
  0.14%    0.21%    pop    %rbp
  3.09%    1.99%    test   %eax,0x1815b50f(%rip)
  3.10%    2.88%    retq

…​but this is the real cheat: instead of comparing the classword twice, we compare the flag once, and select either alternative. This explains a significant improvement against the C2 bimorphic inlining. But even in this case, the branch misprediction costs provide the asymmetry across different biases. bias = 0.5, obviously, has the largest chance of mispredicted branches.

4.2.2. C1: static_…​

Benchmark                  (bias)  (count)  Mode  Cnt    Score   Error  Units

# Reference
Two.dynamic_Abstract_Ref      0.0    10000  avgt   50   57.449 ± 2.096  us/op
Two.dynamic_Abstract_Ref      0.1    10000  avgt   50   74.185 ± 1.606  us/op
Two.dynamic_Abstract_Ref      0.5    10000  avgt   50  120.458 ± 0.397  us/op
Two.dynamic_Abstract_Ref      0.9    10000  avgt   50   75.714 ± 1.695  us/op
Two.dynamic_Abstract_Ref      1.0    10000  avgt   50   58.020 ± 1.606  us/op

# if (coder0) Coder0.staticWork(data) else Coder1.staticWork(data);
Two.static_Bool_ifElse        0.0    10000  avgt   50   41.409 ± 0.145  us/op
Two.static_Bool_ifElse        0.1    10000  avgt   50   52.084 ± 0.410  us/op
Two.static_Bool_ifElse        0.5    10000  avgt   50   82.190 ± 1.855  us/op
Two.static_Bool_ifElse        0.9    10000  avgt   50   52.306 ± 0.423  us/op
Two.static_Bool_ifElse        1.0    10000  avgt   50   45.208 ± 0.513  us/op

# if (id == 0) Coder0.staticWork(data) else if (id == 1) Coder1.staticWork(data)
Two.static_ID_ifElse          0.0    10000  avgt   50   40.777 ± 1.236  us/op
Two.static_ID_ifElse          0.1    10000  avgt   50   53.125 ± 2.200  us/op
Two.static_ID_ifElse          0.5    10000  avgt   50   83.833 ± 1.231  us/op
Two.static_ID_ifElse          0.9    10000  avgt   50   53.826 ± 0.490  us/op
Two.static_ID_ifElse          1.0    10000  avgt   50   47.484 ± 0.802  us/op

# switch (id) case 0: Coder0.staticWork(data); break; case 1: Coder1.staticWork(data); break;
Two.static_ID_switch          0.0    10000  avgt   50   45.338 ± 1.404  us/op
Two.static_ID_switch          0.1    10000  avgt   50   53.087 ± 0.193  us/op
Two.static_ID_switch          0.5    10000  avgt   50   83.418 ± 0.900  us/op
Two.static_ID_switch          0.9    10000  avgt   50   54.315 ± 1.435  us/op
Two.static_ID_switch          1.0    10000  avgt   50   46.568 ± 0.446  us/op

Remember how C1 is not able to properly inline neither abstract nor interface bimorphic calls. This is why our cheaty ways of dispatch are apparently winning. Let’s confirm with some disassembly.

Two.do_Static_Bool_ifElse, bias = 0.0:

                  [Verified Entry Point]
  3.20%    3.99%    mov    %eax,-0x14000(%rsp)
  2.34%    2.78%    push   %rbp
  2.23%    1.68%    sub    $0x30,%rsp
  1.29%    2.36%    movsbl 0xc(%rsi),%eax        ; get field $coder0
  4.58%    4.19%    cmp    $0x0,%eax
                    je     CODER_0               ; jump out if ($isCoder0 == false)
  2.59%    1.63%    mov    0x10(%rsi),%eax       ; get field $data
  1.91%    1.92%    shl    $0x3,%rax             ;   unpack it
  0.48%    0.91%    mov    0xc(%rax),%eax        ; Coder0.staticWork() body, arraylength
 36.12%   30.38%    add    $0x30,%rsp            ; epilogue and return
  1.07%    0.45%    pop    %rbp
  0.32%    0.86%    test   %eax,0x16b80b92(%rip)
  5.06%    3.71%    retq

                  CODER_0:
                    mov    0x10(%rsi),%eax
                    shl    $0x3,%rax             ; handle ($isCoder0 == false) case
                    mov    0xc(%rax),%eax        ; Coder1.staticWork() body, arraylength
                    ...

C1 is not emitting the uncommon traps, and generates both branches. The frequency-based basic block layout still moved the most frequent code path higher. Note how both Coder0 and Coder1 calls are inlined.

do_Static_Bool_ifElse and bias = 0.5:

                  [Verified Entry Point]
  1.89%    2.80%    mov    %eax,-0x14000(%rsp)
  0.91%    0.65%    push   %rbp
  1.84%    1.99%    sub    $0x30,%rsp
  0.24%    0.30%    movsbl 0xc(%rsi),%eax        ; get field $coder0
  1.97%    1.56%    cmp    $0x0,%eax
                    je     CODER_1               ; jump out if ($isCoder0 == false)
  3.54%    4.32%    mov    0x10(%rsi),%eax       ; get field $data
  6.23%    7.63%    shl    $0x3,%rax             ;   unpack it
  0.56%    1.26%    mov    0xc(%rax),%eax        ; Coder0.staticWork() body
  9.48%   10.11%    add    $0x30,%rsp            ; epilogue and return
  0.09%    0.07%    pop    %rbp
  0.07%    0.06%    test   %eax,0x16982ed2(%rip)
  2.17%    1.82%    retq

                  CODER_1:
  3.99%    4.16%    mov    0x10(%rsi),%eax       ; get field $data
  5.14%    7.07%    shl    $0x3,%rax             ;   unpack it
  0.82%    1.32%    mov    0xc(%rax),%eax        ; Coder1.staticWork() body
 13.50%   13.58%    add    $0x30,%rsp            ; epilogue and return
  0.13%             pop    %rbp
           0.06%    test   %eax,0x16982ebc(%rip)
  2.50%    1.60%    retq

This code looks more aesthetically pleasing to my eyes: there is only a single test, and two complete branches. It would be even better if global code motion moved the getfield of $data before the branch, but again, the goal for C1 is to produce the code fast.

4.2.3. Interpreter

Benchmark                  (bias)  (count)  Mode  Cnt    Score    Error  Units

# Reference
Two.dynamic_Abstract_Ref      0.0    10000  avgt   50  454.018 ±  1.001  us/op
Two.dynamic_Abstract_Ref      0.1    10000  avgt   50  459.251 ±  0.366  us/op
Two.dynamic_Abstract_Ref      0.5    10000  avgt   50  458.424 ±  2.020  us/op
Two.dynamic_Abstract_Ref      0.9    10000  avgt   50  464.937 ±  7.386  us/op
Two.dynamic_Abstract_Ref      1.0    10000  avgt   50  454.475 ±  1.075  us/op

# if (coder0) Coder0.staticWork(data) else Coder1.staticWork(data);
Two.static_Bool_ifElse        0.0    10000  avgt   50  432.259 ±  1.603  us/op
Two.static_Bool_ifElse        0.1    10000  avgt   50  445.540 ±  1.603  us/op
Two.static_Bool_ifElse        0.5    10000  avgt   50  490.248 ±  1.594  us/op
Two.static_Bool_ifElse        0.9    10000  avgt   50  460.986 ±  6.560  us/op
Two.static_Bool_ifElse        1.0    10000  avgt   50  444.312 ±  1.988  us/op

# if (id == 0) Coder0.staticWork(data) else if (id == 1) Coder1.staticWork(data) ...
Two.static_ID_ifElse          0.0    10000  avgt   50  435.689 ±  4.659  us/op
Two.static_ID_ifElse          0.1    10000  avgt   50  449.781 ±  2.840  us/op
Two.static_ID_ifElse          0.5    10000  avgt   50  507.018 ±  3.015  us/op
Two.static_ID_ifElse          0.9    10000  avgt   50  469.138 ±  0.913  us/op
Two.static_ID_ifElse          1.0    10000  avgt   50  461.163 ±  0.813  us/op

# switch (id) case 0: Coder0.staticWork(data); break; case 1: Coder1.staticWork(data); break;
Two.static_ID_switch          0.0    10000  avgt   50  474.470 ±  1.393  us/op
Two.static_ID_switch          0.1    10000  avgt   50  476.816 ±  1.699  us/op
Two.static_ID_switch          0.5    10000  avgt   50  506.091 ±  1.520  us/op
Two.static_ID_switch          0.9    10000  avgt   50  472.006 ±  4.305  us/op
Two.static_ID_switch          1.0    10000  avgt   50  465.332 ±  2.689  us/op

Argh, but this trick does not work for the interpreter. The reason is actually quite simple: the additional accesses for coder static instances cost something.

4.2.4. Discussion II.2

As we predicted, cheating VM by dispatching the work manually works:

  1. In C2 case, we save a few bytes in the instruction stream by not comparing the classword, and generally producing denser code. It is cheating, since compiler has to provide the capabilities for de-optimization, handling other types, if any come in the future, etc., but our own code ignores all those concerns. That is, it’s not the compiler fault to produce less optimal code, it’s just has to be more generic.

  2. In C1 case, we enable the inlining of target methods with using static targets. The dispatch code compiles into something similar to what the C2 version is doing. Therefore, the performance across C1 and C2 is more consistent here.

Once again, these optimizations should be employed only where the peak performance is needed. The code otherwise produced by C2 is rather nice.

4.3. Cheated By The Runtime

All right, but what about combining the auxiliary selector and dynamic cases? Surely we will set ourselves up for better inlining decisions if we do dynamic calls off the static final constants? This is where it starts to get funny. Let’s take C2 again, and run one of the dynamic_…​ tests with a trick:

private AbstractCoder abstract_Bool_IfElse() {
    return (isCoder0 ? coder0 : coder1);
}

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Dynamic_Abstract_Bool_ifElse() {
    return abstract_Bool_IfElse().abstractWork(data);
}
Benchmark                         (bias)  (count)  Mode  Cnt    Score   Error  Units
Two.dynamic_Abstract_Ref             0.0    10000  avgt   50   54.419 ± 1.733  us/op
Two.dynamic_Abstract_Ref             0.1    10000  avgt   50   58.244 ± 0.652  us/op
Two.dynamic_Abstract_Ref             0.5    10000  avgt   50   95.765 ± 0.180  us/op
Two.dynamic_Abstract_Ref             0.9    10000  avgt   50   58.618 ± 0.843  us/op
Two.dynamic_Abstract_Ref             1.0    10000  avgt   50   52.412 ± 0.185  us/op

Two.dynamic_Abstract_Bool_ifElse     0.0    10000  avgt   50   50.240 ± 3.273  us/op
Two.dynamic_Abstract_Bool_ifElse     0.1    10000  avgt   50   61.233 ± 0.257  us/op
Two.dynamic_Abstract_Bool_ifElse     0.5    10000  avgt   50  100.861 ± 0.792  us/op
Two.dynamic_Abstract_Bool_ifElse     0.9    10000  avgt   50   60.693 ± 0.119  us/op
Two.dynamic_Abstract_Bool_ifElse     1.0    10000  avgt   50   54.018 ± 1.370  us/op

With bias=0.0, the generated code looks very optimized.

                  [Verified Entry Point]
  1.41%    0.97%    mov    %eax,-0x14000(%rsp)
  3.47%    2.98%    push   %rbp
  0.02%    0.02%    sub    $0x20,%rsp
  1.31%    0.77%    movzbl 0xc(%rsi),%r11d        ; get field $isCoder0
 17.88%   24.63%    test   %r11d,%r11d
                    je     SLOW_PATH              ; jump out if ($isCoder0 == false)
  1.58%    1.72%    mov    0x10(%rsi),%r11d       ; get field $data
  3.93%    4.99%    mov    0xc(%r12,%r11,8),%eax  ; Coder0.abstractWork() body, arraylength
 32.66%   27.46%    add    $0x20,%rsp             ; epilog and return
  1.43%    1.60%    pop    %rbp
  0.02%             test   %eax,0x17cdcad6(%rip)
  6.41%    5.08%    retq

                  SLOW_PATH:
                    mov    %rsi,%rbp
                    mov    %r11d,(%rsp)
                    mov    $0xffffff65,%esi
                    callq  0x00007f92150051a0     ; <uncommon trap>

In fact, it is almost the same code that static_Bool_isElse is generating! Dumb performance tests like this will delude you into believing you this trick costs you nothing when a sufficiently smart compiler is around. It gets obvious you shouldn’t rely on this, if when you try with non-trivial bias, like bias=0.5:

                  [Verified Entry Point]
  0.37%    0.77%    mov    %eax,-0x14000(%rsp)
  1.84%    2.44%    push   %rbp
  0.13%    0.12%    sub    $0x20,%rsp
  0.40%    0.77%    mov    0x10(%rsi),%r11d       ; get field $data
 16.73%   16.54%    movzbl 0xc(%rsi),%r10d        ; get field $isCoder0
  2.04%    2.25%    mov    $0x71943eae0,%r8       ; get static field Data.coder1
  0.13%    0.25%    mov    $0x71943ead0,%r9       ; get static field Data.coder0
  0.10%    0.22%    test   %r10d,%r10d            ; select either field based on $isCoder
  2.01%    1.79%    cmovne %r9,%r8
  2.24%    1.77%    mov    0x8(%r8),%r10d         ; load coder.<classword>
  8.38%    9.45%    cmp    $0xf801040a,%r10d      ; if coder is Coder0
                    je     CODER_0                ;   jump over
  5.65%    9.22%    cmp    $0xf8012585,%r10d      ; if coder is Coder1
                    jne    SLOW_PATH              ;   not? jump to slow path
  3.39%    3.74%    mov    0xc(%r12,%r11,8),%eax  ; Coder1.abstractWork() body, arraylength
 10.31%    9.37%    jmp    EPILOG                 ; jump to epilog

                  CODER_0:
  5.18%    8.45%    mov    0xc(%r12,%r11,8),%eax  ; Coder0.abstractWork() body, arraylength

                  EPILOG:
 16.63%   15.14%    add    $0x20,%rsp             ; epilog and return
  0.17%    0.27%    pop    %rbp
  1.92%    0.47%    test   %eax,0x188e7ce3(%rip)
  2.64%    2.05%    retq

                  SLOW_PATH:
                    mov    $0xffffffc6,%esi
                    mov    %r8,%rbp
                    mov    %r11d,(%rsp)
                    nop
                    callq  0x00007f3dc10051a0     ; <uncommon trap>

See what happened? We do double work: first, we select the static field based on the flag, as our Java code does; then, we are doing the bimorphic inlining of based on the actual type. This is regardless the fact we may infer the actual type from the static final constant itself. The trick is not working, and it actually degrades performance compared to dynamic_Ref case.

4.4. Emulating The Inlining

An impatient reader may wonder what will happen if we emulate the type checks with explicit instanceof checks? We can easily add a few more tests that dispatch like that, for example:

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int do_Static_Interface_Ref_ifElse() {
    if (coder instanceof Coder0) {
        return Coder0.staticWork(data);
    } else if (coder instanceof Coder1) {
        return Coder1.staticWork(data);
    } else {
        throw new IllegalStateException();
    }
}

Of course, we don’t know beforehand if the performance is affected by instanceof-ing against abstract or interface instance, and therefore we need to check both. Similarly, we don’t know if calling the dynamic method after the instanceof check will be nicely optimized, so we have to try calling static method as well.

4.4.1. C2

Benchmark                         (bias)  (count)  Mode  Cnt    Score   Error  Units

; abstractCoder.abstractWork(data)
Two.dynamic_Abstract_Ref             0.0    10000  avgt   50   54.419 ± 1.733  us/op
Two.dynamic_Abstract_Ref             0.1    10000  avgt   50   58.244 ± 0.652  us/op
Two.dynamic_Abstract_Ref             0.5    10000  avgt   50   95.765 ± 0.180  us/op
Two.dynamic_Abstract_Ref             0.9    10000  avgt   50   58.618 ± 0.843  us/op
Two.dynamic_Abstract_Ref             1.0    10000  avgt   50   52.412 ± 0.185  us/op

; if (abstractCoder instanceof Coder0) abstractCoder.abstractWork(data) ...
Two.dynamic_Abstract_Ref_ifElse      0.0    10000  avgt   50   54.275 ± 0.209  us/op
Two.dynamic_Abstract_Ref_ifElse      0.1    10000  avgt   50   62.353 ± 1.610  us/op
Two.dynamic_Abstract_Ref_ifElse      0.5    10000  avgt   50   95.331 ± 0.252  us/op
Two.dynamic_Abstract_Ref_ifElse      0.9    10000  avgt   50   62.308 ± 0.579  us/op
Two.dynamic_Abstract_Ref_ifElse      1.0    10000  avgt   50   55.500 ± 0.423  us/op

; if (coder instanceof Coder0) coder.work(data) ...
Two.dynamic_Interface_Ref_ifElse     0.0    10000  avgt   50   52.735 ± 0.703  us/op
Two.dynamic_Interface_Ref_ifElse     0.1    10000  avgt   50   62.417 ± 0.625  us/op
Two.dynamic_Interface_Ref_ifElse     0.5    10000  avgt   50   95.351 ± 0.492  us/op
Two.dynamic_Interface_Ref_ifElse     0.9    10000  avgt   50   63.060 ± 0.774  us/op
Two.dynamic_Interface_Ref_ifElse     1.0    10000  avgt   50   55.465 ± 0.811  us/op

; if (coder instanceof Coder0) Coder0.staticWork(data) ...
Two.static_Interface_Ref_ifElse      0.0    10000  avgt   50   50.531 ± 2.636  us/op
Two.static_Interface_Ref_ifElse      0.1    10000  avgt   50   62.725 ± 1.784  us/op
Two.static_Interface_Ref_ifElse      0.5    10000  avgt   50   95.554 ± 0.481  us/op
Two.static_Interface_Ref_ifElse      0.9    10000  avgt   50   61.832 ± 0.644  us/op
Two.static_Interface_Ref_ifElse      1.0    10000  avgt   50   53.918 ± 0.730  us/op

; if (abstractCoder instanceof Coder0) Coder0.staticWork(data) ...
Two.static_Abstract_Ref_ifElse       0.0    10000  avgt   50   55.039 ± 0.425  us/op
Two.static_Abstract_Ref_ifElse       0.1    10000  avgt   50   57.879 ± 0.892  us/op
Two.static_Abstract_Ref_ifElse       0.5    10000  avgt   50   95.014 ± 0.660  us/op
Two.static_Abstract_Ref_ifElse       0.9    10000  avgt   50   57.665 ± 0.795  us/op
Two.static_Abstract_Ref_ifElse       1.0    10000  avgt   50   49.874 ± 0.183  us/op

These results seem to imply there is no significant difference either way. It will be enlightening if we disassemble a few cases to understand how they are compiled

dynamic_Interface_Ref_ifElse with bias=0.1:

                  [Verified Entry Point]
           0.01%    mov    %eax,-0x14000(%rsp)
  4.22%    3.71%    push   %rbp
  0.01%             sub    $0x20,%rsp
  0.05%    0.21%    mov    0x14(%rsi),%r11d        ; get field $coder
 15.77%   20.67%    mov    0x8(%r12,%r11,8),%r10d  ; load $coder.<classword>
  7.16%    7.92%    mov    0x10(%rsi),%r9d         ; get field $data
  0.17%    0.08%    cmp    $0xf8010422,%r10d       ; check $coder is Coder0
                    jne    CODER_1                 ;    jump further if not
  1.27%    1.59%    mov    0xc(%r12,%r9,8),%eax    ; Coder0.work() body, arraylength

                  EPILOG:
 32.13%   29.84%    add    $0x20,%rsp              ; epilogue and return
                    pop    %rbp
  0.55%    0.24%    test   %eax,0x18a10f8e(%rip)
  4.84%    3.66%    retq

                  CODER_1:
  1.73%    1.65%    cmp    $0xf8012585,%r10d       ; check if $coder is Coder1
                    jne    SLOW_PATH               ;   jump to slowpath if not
  1.28%    1.64%    mov    0xc(%r12,%r9,8),%eax    ; Coder1.work() body, arraylength
  3.59%    3.13%    jmp    EPILOG                  ; jump back

                  SLOW_PATH:
                    <omitted>

static_Abstract_Ref_ifElse with bias=0.1:

                  [Verified Entry Point]
  0.01%             mov    %eax,-0x14000(%rsp)
  4.23%    3.68%    push   %rbp
           0.01%    sub    $0x20,%rsp
  0.04%    0.08%    mov    0x18(%rsi),%r11d        ; get field $coder
 18.17%   23.09%    mov    0x8(%r12,%r11,8),%r10d  ; load $coder.<classword>
 13.54%   16.20%    mov    0x10(%rsi),%r9d         ; get field $data
  1.18%    1.34%    cmp    $0xf8010422,%r10d       ; check $coder is Coder0
                    jne    CODER_1                 ;    jump further if not
  2.08%    2.68%    mov    0xc(%r12,%r9,8),%eax    ; Coder0.work() body, arraylength

                  EPILOG:
 19.26%   15.89%    add    $0x20,%rsp              ; epilogue and return
                    pop    %rbp
  0.66%    0.41%    test   %eax,0x18b55c8e(%rip)
  4.60%    3.56%    retq

                  CODER_1:
  2.54%    2.16%    cmp    $0xf8012585,%r10d       ; check if $coder is Coder1
                    jne    SLOW_PATH               ;   jump to slowpath if not
  1.67%    1.79%    mov    0xc(%r12,%r9,8),%eax    ; Coder1.work() body, arraylength
  3.71%    2.54%    jmp    EPILOG                  ; jump back

                  SLOW_PATH:
                    <omitted>

Not only these pieces of code are exactly the same, they are also equivalent to letting C2 doing the bimorphic inlining on its own! Let us remember this low-level trick, as we will reuse it later when we deal with several types.

This is possible since instanceof is routinely handled by compilers, and parsed into the same typecheck as the compiler would do itself. It is not surprising, then, to see VM had generated exactly the same code.

4.4.2. C1

Benchmark                         (bias)  (count)  Mode  Cnt    Score   Error  Units

; abstractCoder.abstractWork(data)
Two.dynamic_Abstract_Ref             0.0    10000  avgt   50   57.449 ± 2.096  us/op
Two.dynamic_Abstract_Ref             0.1    10000  avgt   50   74.185 ± 1.606  us/op
Two.dynamic_Abstract_Ref             0.5    10000  avgt   50  120.458 ± 0.397  us/op
Two.dynamic_Abstract_Ref             0.9    10000  avgt   50   75.714 ± 1.695  us/op
Two.dynamic_Abstract_Ref             1.0    10000  avgt   50   58.020 ± 1.606  us/op

; if (abstractCoder instanceof Coder0) abstractCoder.abstractWork(data) ...
Two.dynamic_Abstract_Ref_ifElse      0.0    10000  avgt   50   67.736 ± 0.628  us/op
Two.dynamic_Abstract_Ref_ifElse      0.1    10000  avgt   50   78.921 ± 0.484  us/op
Two.dynamic_Abstract_Ref_ifElse      0.5    10000  avgt   50  127.833 ± 0.418  us/op
Two.dynamic_Abstract_Ref_ifElse      0.9    10000  avgt   50   87.741 ± 0.829  us/op
Two.dynamic_Abstract_Ref_ifElse      1.0    10000  avgt   50   77.055 ± 1.126  us/op

; if (coder instanceof Coder0) coder.work(data) ...
Two.dynamic_Interface_Ref_ifElse     0.0    10000  avgt   50   67.132 ± 0.557  us/op
Two.dynamic_Interface_Ref_ifElse     0.1    10000  avgt   50   81.280 ± 1.099  us/op
Two.dynamic_Interface_Ref_ifElse     0.5    10000  avgt   50  127.587 ± 0.195  us/op
Two.dynamic_Interface_Ref_ifElse     0.9    10000  avgt   50   87.288 ± 0.809  us/op
Two.dynamic_Interface_Ref_ifElse     1.0    10000  avgt   50   75.220 ± 1.172  us/op

; if (coder instanceof Coder0) Coder0.staticWork(data) ...
Two.static_Interface_Ref_ifElse      0.0    10000  avgt   50   51.127 ± 0.897  us/op
Two.static_Interface_Ref_ifElse      0.1    10000  avgt   50   60.506 ± 2.378  us/op
Two.static_Interface_Ref_ifElse      0.5    10000  avgt   50  103.250 ± 0.207  us/op
Two.static_Interface_Ref_ifElse      0.9    10000  avgt   50   69.919 ± 3.465  us/op
Two.static_Interface_Ref_ifElse      1.0    10000  avgt   50   59.107 ± 2.813  us/op

; if (abstractCoder instanceof Coder0) Coder0.staticWork(data) ...
Two.static_Abstract_Ref_ifElse       0.0    10000  avgt   50   50.748 ± 0.797  us/op
Two.static_Abstract_Ref_ifElse       0.1    10000  avgt   50   61.218 ± 0.320  us/op
Two.static_Abstract_Ref_ifElse       0.5    10000  avgt   50  103.684 ± 0.446  us/op
Two.static_Abstract_Ref_ifElse       0.9    10000  avgt   50   69.081 ± 0.104  us/op
Two.static_Abstract_Ref_ifElse       1.0    10000  avgt   50   58.921 ± 0.335  us/op

C1 responds to this hack less candidly. dynamic cases are generally slower, because there is still a virtual/interface call involved, that C1 is not able to inline. static cases are generally faster, because they avoid doing virtual/interface calls. Still, static cases are not on par with clean "by ID" selectors, since the generated code is much less optimized. Interested readers are invited to disassemble and study it.

4.5. Discussion II

Looking at the case of two coders, we can spot a few things:

  1. The usual coding practice of separating the distinct behavior in the distinct (sub)classes or interface implementors is nicely optimized in C2. C1 has some problems with efficiently inlining the bimorphic cases, but thankfully users will only observe this behavior transiently, until C2 kicks in.

  2. In some cases, when you know the call targets exactly, it might be a good idea to dispatch statically across a few implementations, in order to avoid dealing with compiler optimizations that still have some overhead left in the generated code. Again, doing so will probably yield a non-generic solution, and that’s one of the reasons it could be faster. The compilers should provide generic implementations, and hence can help only that much.

  3. The performance costs we observed, especially in an optimized case, are related to hardware branch (mis)predictions. Even the most optimized code still ought to have branches. It may look like there is no reason to inline the virtual calls then, but that’s a shortsighted view. The inlining actually broadens the scope of other optimizations, and that alone is, in many cases, enough reason to inline. With very thin nanobenchmarks, that advantage cannot be properly quantified.

5. Three Types And Beyond

With three coder types, the coder distribution makes for even more challenging task. The easiest solution would seem to run with different "parts" per coder type.

@Param("1")
private int p1;

@Param("1")
private int p2;

@Param("1")
private int p3;

private Data[] datas;

@Setup
public void setup() {
    Random r = new Random(12345);

    int s1 = (count * p1) / (p1 + p2 + p3);
    int s2 = (count * (p1 + p2)) / (p1 + p2 + p3);

    List<Data> ts = new ArrayList<Data>();
    for (int c = 0; c < count; c++) {
        byte[] contents = new byte[10];
        r.nextBytes(contents);

        byte id;
        if (c < s1) {
            id = 0;
        } else if (c < s2) {
            id = 1;
        } else {
            id = 2;
        }
        ts.add(new Data(id, contents));
    }
    Collections.shuffle(ts, r);
    datas = ts.toArray(new Data[0]);
}

Also, we need to extend the selectors for three types. Here is the source code.

5.1. Monomorphic cases

Remember from the previous parts, that compilers make the inlining decisions based on profile. When there is only a single observed type per call site (i.e. the call site is "monomorphic"), or two observed types per call site (i.e. the call site is "bimorphic"), at least C2 is able to optimize this case. Before going further, it makes sense to verify this still holds true for our three-type benchmarks. This way we control that our knowledge from the previous part can be used in this test.

5.1.1. C2

Benchmark                    (count)  (p1)  (p2)  (p3)  Mode  Cnt    Score   Error  Units
Three.dynamic_Abstract_Ref     10000     1     0     0  avgt   50   52.384 ± 0.653  us/op
Three.dynamic_Interface_Ref    10000     1     0     0  avgt   50   51.807 ± 0.453  us/op
Three.static_ID_ifElse         10000     1     0     0  avgt   50   48.162 ± 0.182  us/op
Three.static_ID_switch         10000     1     0     0  avgt   50   48.597 ± 0.539  us/op

Three.dynamic_Abstract_Ref     10000     0     1     0  avgt   50   51.663 ± 0.159  us/op
Three.dynamic_Interface_Ref    10000     0     1     0  avgt   50   51.784 ± 0.824  us/op
Three.static_ID_ifElse         10000     0     1     0  avgt   50   48.308 ± 0.855  us/op
Three.static_ID_switch         10000     0     1     0  avgt   50   48.514 ± 0.148  us/op

Three.dynamic_Abstract_Ref     10000     0     0     1  avgt   50   52.356 ± 0.541  us/op
Three.dynamic_Interface_Ref    10000     0     0     1  avgt   50   52.874 ± 0.644  us/op
Three.static_ID_ifElse         10000     0     0     1  avgt   50   48.349 ± 0.200  us/op
Three.static_ID_switch         10000     0     0     1  avgt   50   48.651 ± 0.052  us/op

Everything adds up so far. We covered why the static cases are slightly faster than dynamic cases before.

It is interesting to note that even though …​ifElse cases produce the static layout, the performance of which should theoretically be affected by which coder dominates, it is not affected in practice. For example, static_ID_ifElse with p1/p2/p3 = 0/0/1 looks like this:

                  [Verified Entry Point]
  0.06%    0.06%    mov    %eax,-0x14000(%rsp)
  4.44%    3.30%    push   %rbp
  0.06%    0.10%    sub    $0x20,%rsp
  0.02%    0.03%    movsbl 0xc(%rsi),%r11d        ; get field $id
 12.10%   16.93%    test   %r11d,%r11d
                    je     ID_0                   ; jump if ($id == 0)
  0.65%    0.66%    cmp    $0x1,%r11d
                    je     ID_1                   ; jump if ($id == 1)
  0.75%    0.92%    cmp    $0x2,%r11d
                    jne    FAIL                   ; jump if ($id != 2)
  0.77%    0.85%    mov    0x10(%rsi),%r11d       ; get field $data
  4.15%    2.73%    mov    0xc(%r12,%r11,8),%eax  ; Coder2.staticWork() body, arraylength
 45.11%   44.93%    add    $0x20,%rsp             ; epilogue and return
  0.03%             pop    %rbp
  0.03%    0.06%    test   %eax,0x170e758a(%rip)
  6.48%    5.66%    retq

                  FAIL:
                    <omitted>

                  ID_0:
                    <omitted>

                  ID_1:
                    <omitted>

Even though we have to go through 2 redundant branches, we still maintain almost the same performance. All branches seem to be nicely predicted.

5.1.2. C1

Benchmark                    (count)  (p1)  (p2)  (p3)  Mode  Cnt    Score   Error  Units
Three.dynamic_Abstract_Ref     10000     1     0     0  avgt   50   56.038 ± 2.058  us/op
Three.dynamic_Interface_Ref    10000     1     0     0  avgt   50   55.953 ± 1.388  us/op
Three.static_ID_ifElse         10000     1     0     0  avgt   50   45.050 ± 1.048  us/op
Three.static_ID_switch         10000     1     0     0  avgt   50   45.899 ± 0.199  us/op

Three.dynamic_Abstract_Ref     10000     0     1     0  avgt   50   56.541 ± 0.562  us/op
Three.dynamic_Interface_Ref    10000     0     1     0  avgt   50   55.355 ± 0.182  us/op
Three.static_ID_ifElse         10000     0     1     0  avgt   50   47.162 ± 0.469  us/op
Three.static_ID_switch         10000     0     1     0  avgt   50   46.968 ± 0.198  us/op

Three.dynamic_Abstract_Ref     10000     0     0     1  avgt   50   55.299 ± 0.118  us/op
Three.dynamic_Interface_Ref    10000     0     0     1  avgt   50   56.877 ± 0.469  us/op
Three.static_ID_ifElse         10000     0     0     1  avgt   50   46.654 ± 0.871  us/op
Three.static_ID_switch         10000     0     0     1  avgt   50   47.972 ± 0.198  us/op

The same picture is painted in the C1 case. We already know why static cases are slightly faster than dynamic cases, see the cases for single and two types. Here, however, the position at which the type is located seems to matter a bit.

  2.37%    3.29%    mov    %eax,-0x14000(%rsp)
  3.12%    2.04%    push   %rbp
  2.12%    2.18%    sub    $0x40,%rsp
  1.46%    0.72%    movsbl 0xc(%rsi),%eax        ; get field $id
  3.43%    2.96%    cmp    $0x0,%eax
                    je     ID_0                  ; jump if ($id == 0)
  2.16%    2.17%    cmp    $0x1,%eax
                    je     ID_1                  ; jump if ($id == 1)
  0.80%    0.89%    cmp    $0x2,%eax
                    jne    FAIL                  ; jump if ($id != 2)
  1.75%    1.12%    mov    0x10(%rsi),%eax       ; get field $data
  3.60%    3.56%    shl    $0x3,%rax             ;   unpack
  1.85%    2.10%    mov    0xc(%rax),%eax        ; Coder2.staticWork() body, arraylength
 29.12%   24.91%    add    $0x40,%rsp            ; epilogue and return
  0.71%    1.14%    pop    %rbp
  1.44%    1.22%    test   %eax,0x18217420(%rip)
  5.43%    5.24%    retq

                  FAIL:
                    <omitted>

                  ID_0:
                    <omitted>

                  ID_1:
                    <omitted>

The instruction selection is a bit different, as one could expect from different compilers, which may explain the performance difference between the compiled code versions, and why C1 version is reacting to how many branches the code is taking.

My own speculation is like this: cmp $0x0, %eax is actually slightly slower than test %eax, %eax. These tiny micro-architectural differences are sometimes what nano-benchmarks are all about.

5.2. Bimorphic cases

5.2.1. C2

Benchmark                    (count)  (p1)  (p2)  (p3)  Mode  Cnt    Score   Error  Units
Three.dynamic_Abstract_Ref     10000     0     1     1  avgt   50   95.986 ± 0.835  us/op
Three.dynamic_Interface_Ref    10000     0     1     1  avgt   50   97.242 ± 2.292  us/op
Three.static_ID_ifElse         10000     0     1     1  avgt   50   89.957 ± 0.164  us/op
Three.static_ID_switch         10000     0     1     1  avgt   50   92.412 ± 2.351  us/op

Three.dynamic_Abstract_Ref     10000     1     0     1  avgt   50   97.081 ± 1.534  us/op
Three.dynamic_Interface_Ref    10000     1     0     1  avgt   50   97.001 ± 2.200  us/op
Three.static_ID_ifElse         10000     1     0     1  avgt   50   89.764 ± 1.230  us/op
Three.static_ID_switch         10000     1     0     1  avgt   50   92.158 ± 1.512  us/op

Three.dynamic_Abstract_Ref     10000     1     1     0  avgt   50   96.243 ± 1.528  us/op
Three.dynamic_Interface_Ref    10000     1     1     0  avgt   50   95.130 ± 0.234  us/op
Three.static_ID_ifElse         10000     1     1     0  avgt   50   88.195 ± 0.296  us/op
Three.static_ID_switch         10000     1     1     0  avgt   50   89.794 ± 0.124  us/op

The same thing applies to the bimorphic case. We already know that C2 is able to do the bimorphic inlining. We also know we can cheat and save a few cycles by doing the dispatch on our own.

5.2.2. C1

Benchmark                    (count)  (p1)  (p2)  (p3)  Mode  Cnt    Score   Error  Units
Three.dynamic_Abstract_Ref     10000     0     1     1  avgt   50  119.439 ± 0.371  us/op
Three.dynamic_Interface_Ref    10000     0     1     1  avgt   50  136.081 ± 0.605  us/op
Three.static_ID_ifElse         10000     0     1     1  avgt   50   84.079 ± 0.218  us/op
Three.static_ID_switch         10000     0     1     1  avgt   50   84.402 ± 0.860  us/op

Three.dynamic_Abstract_Ref     10000     1     0     1  avgt   50  120.241 ± 0.617  us/op
Three.dynamic_Interface_Ref    10000     1     0     1  avgt   50  136.434 ± 0.673  us/op
Three.static_ID_ifElse         10000     1     0     1  avgt   50   82.849 ± 0.295  us/op
Three.static_ID_switch         10000     1     0     1  avgt   50   83.061 ± 0.401  us/op

Three.dynamic_Abstract_Ref     10000     1     1     0  avgt   50  119.770 ± 0.216  us/op
Three.dynamic_Interface_Ref    10000     1     1     0  avgt   50  137.624 ± 3.552  us/op
Three.static_ID_ifElse         10000     1     1     0  avgt   50   85.120 ± 0.147  us/op
Three.static_ID_switch         10000     1     1     0  avgt   50   84.840 ± 0.218  us/op

Again, nothing new here. We already know that C1 is not able to do proper bimorphic inlining, and we are stuck with non-optimal code. Our custom dispatch enables the inlining, and avoids the VM dispatch.

5.3. Megamorphic cases

Now, when there are more than two types, we are stepping into something new. To make the matters interesting, we are measuring in three configurations: evenly distributing the types (1/1/1), first type to take 90% of all instances (18/1/1), and first type to take 95% of all instances (38/1/1).

5.3.1. C2

Benchmark                    (count)  (p1)  (p2)  (p3)  Mode  Cnt    Score   Error  Units
Three.dynamic_Abstract_Ref     10000     1     1     1  avgt   50  138.611 ± 2.524  us/op
Three.dynamic_Interface_Ref    10000     1     1     1  avgt   50  155.505 ± 3.027  us/op
Three.static_ID_ifElse         10000     1     1     1  avgt   50   99.868 ± 2.026  us/op
Three.static_ID_switch         10000     1     1     1  avgt   50   98.905 ± 0.600  us/op

Three.dynamic_Abstract_Ref     10000    18     1     1  avgt   50   79.674 ± 2.423  us/op
Three.dynamic_Interface_Ref    10000    18     1     1  avgt   50   90.016 ± 0.940  us/op
Three.static_ID_ifElse         10000    18     1     1  avgt   50   57.578 ± 1.079  us/op
Three.static_ID_switch         10000    18     1     1  avgt   50   55.561 ± 0.169  us/op

Three.dynamic_Abstract_Ref     10000    38     1     1  avgt   50   58.465 ± 0.160  us/op
Three.dynamic_Interface_Ref    10000    38     1     1  avgt   50   58.335 ± 0.209  us/op
Three.static_ID_ifElse         10000    38     1     1  avgt   50   51.692 ± 1.070  us/op
Three.static_ID_switch         10000    38     1     1  avgt   50   51.860 ± 0.156  us/op

What can we see here?

  1. static cases get better when runs are biased towards one type. This is due to the better branch prediction. This difference provides an estimate what one should expect from the optimized virtual call. (We additionally verified exactly the same result is produced when we bias the distribution towards any other type, not only the first).

  2. When the types are evenly distributed, we get a severe performance hit in dynamic_…​ cases. This is because HotSpot thinks the call site now has too many receiver types; in other words, the call site is megamorphic. Current C2 does not do megamorphic inlining at all.

  3. …​with the exception of one case, when there is a clear winner, claiming >90% of type profile.

Why we have tried 38/1/1? This distribution allocates 95% of all targets to a single coder, and we know that C2 implementation sets the threshold for this optimistic inlining via TypeProfileMajorReceiverPercent at 90%.

Briefly following this up, and disassembling dynamic_Abstract_Ref at 18/1/1 yields:

                 [Verified Entry Point]
  0.52%    0.53%    mov    %eax,-0x14000(%rsp)
  2.50%    2.89%    push   %rbp
                    sub    $0x10,%rsp
  2.37%    1.20%    mov    0x10(%rsi),%r10d   ; get field $data
  7.67%    9.87%    mov    0x18(%rsi),%r8d    ; get field $abstractCoder
  1.70%    1.88%    mov    %r10,%rdx
  0.37%    0.24%    shl    $0x3,%rdx
  1.33%    0.60%    mov    %r8,%rsi
  0.63%    1.29%    shl    $0x3,%rsi
  0.51%    0.63%    xchg   %ax,%ax
  0.43%    0.26%    mov    $0xffffffffffffffff,%rax
  1.46%    0.48%    callq  0x00007f4bd1046220  ; VIRTUAL CALL TO abstractWork()
  2.96%    0.38%    add    $0x10,%rsp
  0.79%    0.94%    pop    %rbp
  1.44%    1.22%    test   %eax,0x16921d01(%rip)
  0.86%    0.12%    retq

…​and all Coders not inlined in the profile:

....[Hottest Methods (after inlining)]..............................................................
 34.02%   37.66%  net.shipilev.three.Coder0::abstractWork
 25.20%   22.80%  net.shipilev.three.Data::do_Dynamic_Abstract_Ref
 14.79%   15.94%  net.shipilev.three..generated.Three_dynamic_Abstract_Ref::dynamic_Abstract_Ref_avgt_jmhStub
 13.60%   13.36%  java.util.HashMap::hash
  4.54%    3.35%  net.shipilev.three.Coder2::abstractWork
  4.05%    2.98%  net.shipilev.three.Coder1::abstractWork

This is similar to what we saw in C1 case.

Disassembling dynamic_Abstract_Ref at 38/1/1 shows that Coder0.abstractWork() got inlined!

                  [Verified Entry Point]
                    mov    %eax,-0x14000(%rsp)
  3.98%    3.41%    push   %rbp
  0.01%             sub    $0x10,%rsp
                    mov    0x10(%rsi),%r11d        ; get field $data
 10.59%   14.23%    mov    0x18(%rsi),%r10d        ; get field $abstractCoder
  1.96%    2.51%    mov    0x8(%r12,%r10,8),%r9d   ; load $abstractCoder.<classword>
  3.54%    4.89%    cmp    $0xf80103f6,%r9d        ;
                    jne    SLOW_PATH               ; jump if $abstractCoder is not Coder0
  0.98%    1.20%    mov    0xc(%r12,%r11,8),%eax   ; Coder0.abstractWork() body, arraylength
 42.14%   41.75%    add    $0x10,%rsp              ; epilogue and return
  0.01%             pop    %rbp
  0.01%             test   %eax,0x15ed3f4e(%rip)
  4.77%    3.80%    retq

                  SLOW_PATH:
  0.56%    0.66%    mov    %r11,%rdx
  0.22%    0.20%    shl    $0x3,%rdx
  0.43%    0.35%    lea    (%r12,%r10,8),%rsi
                    xchg   %ax,%ax
  0.01%    0.01%    mov    $0xffffffffffffffff,%rax
  0.21%    0.19%    callq  0x00007f564c3c3220     ; VIRTUAL CALL TO abstractWork()

…​and no distinct Coder0::abstractWork in profile:

....[Hottest Methods (after inlining)]..............................................................
 70.49%   73.34%  net.shipilev.three.Data::do_Dynamic_Abstract_Ref
 21.47%   20.44%  net.shipilev.three.generated.Three_dynamic_Abstract_Ref::dynamic_Abstract_Ref_avgt_jmhStub
  2.12%    1.22%  net.shipilev.three.Coder1::abstractWork
  2.05%    1.25%  net.shipilev.three.Coder2::abstractWork
  2.04%    2.12%  java.util.concurrent.ConcurrentHashMap::tabAt

5.3.2. C1

Benchmark                    (count)  (p1)  (p2)  (p3)  Mode  Cnt    Score   Error  Units
Three.dynamic_Abstract_Ref     10000     1     1     1  avgt   50  138.515 ± 0.595  us/op
Three.dynamic_Interface_Ref    10000     1     1     1  avgt   50  153.387 ± 0.184  us/op
Three.static_ID_ifElse         10000     1     1     1  avgt   50   91.239 ± 0.170  us/op
Three.static_ID_switch         10000     1     1     1  avgt   50   91.631 ± 0.821  us/op

Three.dynamic_Abstract_Ref     10000    18     1     1  avgt   50   72.798 ± 0.205  us/op
Three.dynamic_Interface_Ref    10000    18     1     1  avgt   50   88.188 ± 0.180  us/op
Three.static_ID_ifElse         10000    18     1     1  avgt   50   54.326 ± 0.393  us/op
Three.static_ID_switch         10000    18     1     1  avgt   50   54.454 ± 0.081  us/op

Three.dynamic_Abstract_Ref     10000    38     1     1  avgt   50   67.391 ± 0.928  us/op
Three.dynamic_Interface_Ref    10000    38     1     1  avgt   50   81.523 ± 0.192  us/op
Three.static_ID_ifElse         10000    38     1     1  avgt   50   50.885 ± 0.064  us/op
Three.static_ID_switch         10000    38     1     1  avgt   50   50.374 ± 0.154  us/op

static cases behave the same as in C2 case. dynamic cases are not inlined regardless of the profile information. We have seen it does not inline even the bimorphic calls, and naturally it does not inline the megamorphic ones as well. The difference between different distributions is explained by branch prediction in either the VM stubs, or our own static dispatchers.

5.4. Cheating The Runtime

We have already cheated with static cases in this section, but remember the instanceof trick. We know that doing manual instanceof is akin to doing the VM work. Can we play God here, and peel the first coder under the instanceof check? In other words, do something like this:

public int do_Peel_Interface_Static() {
   if (coder instanceof Coder0) {
       return Coder0.staticWork(data);
   } else {
       return coder.work(data);
   }
}

public int do_Peel_Interface_Interface() {
   if (coder instanceof Coder0) {
       return coder.work(data);
   } else {
       return coder.work(data);
   }
}

5.4.1. C2

Benchmark                       (count)  (p1)  (p2)  (p3)  Mode  Cnt    Score   Error  Units

; equidistributed
Three.dynamic_Abstract_Ref        10000     1     1     1  avgt   50  138.611 ± 2.524  us/op
Three.dynamic_Interface_Ref       10000     1     1     1  avgt   50  155.505 ± 3.027  us/op
Three.static_ID_ifElse            10000     1     1     1  avgt   50   99.868 ± 2.026  us/op
Three.static_ID_switch            10000     1     1     1  avgt   50   98.905 ± 0.600  us/op

Three.peel_Abstract_Abstract      10000     1     1     1  avgt   50  105.511 ± 1.733  us/op
Three.peel_Abstract_Static        10000     1     1     1  avgt   50  105.814 ± 1.177  us/op
Three.peel_Interface_Interface    10000     1     1     1  avgt   50  105.841 ± 1.694  us/op
Three.peel_Interface_Static       10000     1     1     1  avgt   50  105.594 ± 1.151  us/op

; biased towards coder0 at 90%
Three.dynamic_Abstract_Ref        10000    18     1     1  avgt   50   79.674 ± 2.423  us/op
Three.dynamic_Interface_Ref       10000    18     1     1  avgt   50   90.016 ± 0.940  us/op
Three.static_ID_ifElse            10000    18     1     1  avgt   50   57.578 ± 1.079  us/op
Three.static_ID_switch            10000    18     1     1  avgt   50   55.561 ± 0.169  us/op

Three.peel_Abstract_Abstract      10000    18     1     1  avgt   50   63.588 ± 2.789  us/op
Three.peel_Abstract_Static        10000    18     1     1  avgt   50   63.164 ± 1.655  us/op
Three.peel_Interface_Interface    10000    18     1     1  avgt   50   64.309 ± 2.956  us/op
Three.peel_Interface_Static       10000    18     1     1  avgt   50   62.016 ± 0.185  us/op

; biased towards coder0 at 95%
Three.dynamic_Abstract_Ref        10000    38     1     1  avgt   50   58.465 ± 0.160  us/op
Three.dynamic_Interface_Ref       10000    38     1     1  avgt   50   58.335 ± 0.209  us/op
Three.static_ID_ifElse            10000    38     1     1  avgt   50   51.692 ± 1.070  us/op
Three.static_ID_switch            10000    38     1     1  avgt   50   51.860 ± 0.156  us/op

Three.peel_Abstract_Abstract      10000    38     1     1  avgt   50   57.915 ± 0.174  us/op
Three.peel_Abstract_Static        10000    38     1     1  avgt   50   58.093 ± 0.263  us/op
Three.peel_Interface_Interface    10000    38     1     1  avgt   50   58.642 ± 1.596  us/op
Three.peel_Interface_Static       10000    38     1     1  avgt   50   57.859 ± 0.198  us/op

What do we see here? Well, it turns out in a truly megamorphic case of 1/1/1, peeling the first coder is clearly profitable. It is profitable with a skewed distribution of 18/1/1 as well. And it stops being profitable at very skewed distribution of 38/1/1. Can you already guess why? You should have learned enough already to provide a sound hypothesis. If you can, then my work is done here.

So, if we look into peel_Interface_Interface at 1/1/1:

                  [Verified Entry Point]
  0.23%    0.20%    mov    %eax,-0x14000(%rsp)
  2.14%    2.45%    push   %rbp
  0.11%    0.06%    sub    $0x20,%rsp
  0.20%    0.54%    mov    0x14(%rsi),%r11d        ; get field $coder
 19.68%   21.62%    mov    0x8(%r12,%r11,8),%r10d  ; load $coder.<classword>
  7.74%   10.11%    mov    0x10(%rsi),%ebp         ; get field $data
  0.17%    0.15%    cmp    $0xf8010426,%r10d       ; check if $coder == Coder0
                    jne    OTHER_CODERS            ;   jump further if not
  5.95%    8.92%    mov    0xc(%r12,%rbp,8),%eax   ; Coder0.work() body, arraylength

                  EPILOG:
  9.35%    8.89%    add    $0x20,%rsp              ; epilogue and return
  0.10%    0.05%    pop    %rbp
  2.89%    1.11%    test   %eax,0x15f1f36f(%rip)
  2.16%    1.32%    retq

                  OTHER_CODERS:
  2.63%    3.82%    cmp    $0xf80125c8,%r10d       ; check if $coder == Coder2
                    je     CODER_2                 ;   jump over if so
  3.53%    4.68%    cmp    $0xf8012585,%r10d       ; check if $coder == Coder1
                    jne    SLOW_PATH               ;   jump to slow path if not
  2.38%    2.17%    mov    0xc(%r12,%rbp,8),%eax   ; Coder1.work() body, arraylength
 10.51%   10.05%    jmp    EPILOG                  ; jump back to epilogue

                  CODER_2:
  2.92%    4.08%    mov    0xc(%r12,%rbp,8),%eax   ; Coder2.work() body, arraylength
  5.34%    5.70%    jmp    EPILOG                  ; jump back to epilogue

                  SLOW_PATH:
                    mov    $0xffffffc6,%esi        ; slow path
                    mov    %r11d,(%rsp)
                    callq  0x00007fb38cdea1a0

See the magic here? The call on else branch inlined both coders. Note this works even if we leave the virtual/interface call under the check, because now we have two call sites, both with distinct type profiles. First call site would be inlined as monomorphic, and the type check would be merged with the original instanceof. Second call site is now effectively bimorphic, and enjoys the bimorphic inlining.

5.4.2. C1

Benchmark                      (count)  (p1)  (p2)  (p3)  Mode  Cnt    Score   Error  Units

; equidistributed
Three.dynamic_Abstract_Ref       10000     1     1     1  avgt   50  138.515 ± 0.595  us/op
Three.dynamic_Interface_Ref      10000     1     1     1  avgt   50  153.387 ± 0.184  us/op
Three.static_ID_ifElse           10000     1     1     1  avgt   50   91.239 ± 0.170  us/op
Three.static_ID_switch           10000     1     1     1  avgt   50   91.631 ± 0.821  us/op

Three.peel_Abstract_Abstract     10000     1     1     1  avgt   50  146.658 ± 4.189  us/op
Three.peel_Abstract_Static       10000     1     1     1  avgt   50  136.430 ± 0.507  us/op
Three.peel_Interface_Interface   10000     1     1     1  avgt   50  156.816 ± 0.440  us/op
Three.peel_Interface_Static      10000     1     1     1  avgt   50  148.564 ± 0.541  us/op

; biased towards coder0 at 90%
Three.dynamic_Abstract_Ref       10000    18     1     1  avgt   50   72.798 ± 0.205  us/op
Three.dynamic_Interface_Ref      10000    18     1     1  avgt   50   88.188 ± 0.180  us/op
Three.static_ID_ifElse           10000    18     1     1  avgt   50   54.326 ± 0.393  us/op
Three.static_ID_switch           10000    18     1     1  avgt   50   54.454 ± 0.081  us/op

Three.peel_Abstract_Abstract     10000    18     1     1  avgt   50   83.206 ± 0.257  us/op
Three.peel_Abstract_Static       10000    18     1     1  avgt   50   65.073 ± 0.731  us/op
Three.peel_Interface_Interface   10000    18     1     1  avgt   50   87.273 ± 0.105  us/op
Three.peel_Interface_Static      10000    18     1     1  avgt   50   66.372 ± 0.121  us/op

; biased towards coder0 at 95%
Three.dynamic_Abstract_Ref       10000    38     1     1  avgt   50   67.391 ± 0.928  us/op
Three.dynamic_Interface_Ref      10000    38     1     1  avgt   50   81.523 ± 0.192  us/op
Three.static_ID_ifElse           10000    38     1     1  avgt   50   50.885 ± 0.064  us/op
Three.static_ID_switch           10000    38     1     1  avgt   50   50.374 ± 0.154  us/op

Three.peel_Abstract_Abstract     10000    38     1     1  avgt   50   76.414 ± 1.011  us/op
Three.peel_Abstract_Static       10000    38     1     1  avgt   50   58.389 ± 1.493  us/op
Three.peel_Interface_Interface   10000    38     1     1  avgt   50   81.647 ± 4.046  us/op
Three.peel_Interface_Static      10000    38     1     1  avgt   50   59.273 ± 0.848  us/op

Unfortunately, this trick is only helping that much on C1. In the absence of profile-assisted inlining, there is not relief from reducing the type profile to two types. Again, interested readers are welcome to disassemble this case if they want.

5.5. Discussion III

Looking at the case of three coders, we can spot a few things:

  1. Even if the static analysis shows there are three possible call targets, compilers can use the dynamic type profile to speculatively optimize for the exact receivers. C2 does this routinely in monomorphic and bimorphic cases, and sometimes in megamorphic cases, when there is a clear winner. C1 does not inline megamorphic calls.

  2. The (absence of) megamorphic inlining hurts at least the nano-benchmarks. There seem to be three ways to avoid it, either peel off the hottest type with manual checks, or decrease TypeProfileMajorReceiverPercent to let VM figure out and inline the most frequent target, or do static dispatch over the known targets. The deal with megamorphic inlining would need to be solved in one way or the other on the VM side, since this potentially affects lots of modern workloads.

Conclusion

DO NOT TRY TO OPTIMIZE LOW-LEVEL STUFF UNLESS YOU REALLY HAVE TO.
PERFORMANCE ADVICE MAY BE HAZARDOUS FOR GENERAL PERFORMANCE, MAINTAINABILITY, AND SANITY. IT IS ALSO KNOWN TO HAVE A VERY LIMITED SHELF LIFE, AND CAUSE HEADACHES WHEN NOT USED CAREFULLY. SEEK YOUR PERFORMANCE ENGINEER’S APPROVAL BEFORE USING.
  1. You should not normally bother with method dispatch performance. The optimizing compiler that produces the final code for hottest methods is able to inline most virtual calls. The remaining points are valid given you identified your problem as the method dispatch/inlining performance problem.

  2. You should actually care about the inlineability of the target method, e.g. it’s size, modifiers, etc. In this post, we have ignored this aspect, since we were using tiny trivial methods. However, if the target method cannot be successfully devirtualized, the inlining will not happen. The inlining actually broadens the scope of other optimizations, and that alone is, in many cases, enough reason to inline. With very thin nanobenchmarks, that advantage cannot be properly quantified.

  3. Class Hierarchy Analysis is able to statically figure out there is only a single subclass of a given class, even in the absence of profile information. Take care when you are adding more subclasses to otherwise a lone super-class in the hierarchy: CHA can fail then, and invalidate your previous performance assumptions.

  4. When CHA fails, C1 inlining for virtual and interface calls also fails, since type profile is not available.

  5. Even when CHA fails, monomorphic and bimorphic calls are routinely inlined by C2. Morphicity is derived from the runtime type profile collected by interpreter or C1.

  6. Megamorphic calls are very bad, neither C1 nor C2 can inline those. At this point, there seem to be three ways to avoid it, either peel off the hottest type with manual checks, or decrease TypeProfileMajorReceiverPercent to let VM figure out and inline the most frequent target, or do static dispatch over the known targets. VM might do better for these cases in future though.

  7. If you have a simple method implementation that does not depend on instance state, it is a good idea to make it static, for both maintainability and performance reasons. Carrying around the class instance just to make the virtual call off it makes life harder for runtime.

  8. When you need peak performance for method dispatch, you may choose to manually dispatch over the static implementations. This goes against the usual development practice, but then quite a few low-level performance hacks take the same diversion route. This is what we are going to do in String Compression work, especially given String is already final, and adding a reference field to String is worse for footprint than adding a byte ID.

  9. If you are choosing between interfaces and abstract classes, interfaces should not be your choice. Unoptimized interface calls are a burden. But, if you have to care about this difference, it means the profile-based de-virtualization and inlining did not happen, and you are probably already screwed.

  10. Mispredicted branches are the killer. If you want to rigorously quantify the effects of low-level hacks like this, you have to consider trying different source data mixes to exploit different branch prediction behaviors, or choose the most unfavorable one pessimistically.