About

"JVM Anatomy Park" is the on-going mini-post series, where every post is slated to take 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. So, the evidence and discussion here are anecdotal, not actually reviewed for errors, consistency, writing style, syntactic and semantic errors, duplicates, or consistency. Use and/or trust this at your own risk.

Aleksey Shipilёv, Performance Geek @ Red Hat OpenJDK Team
Shout out at Twitter: @shipilev
Questions, comments, suggestions: aleksey@shipilev.net

Question

I have heard megamorphic virtual calls are so bad, they are getting called by interpreter, not optimizing compiler. Is that true?

Theory

If you have read numerous articles about virtual call optimization in Hotspot, you may have left with the belief that megamorphic calls are pure evil, because they invoke the slowpath handling, and do not enjoy compiler optimizations. If you try to comprehend what OpenJDK does when it fails to devirtualize the call, you might wonder that it crashes and burns performance-wise. But, consider that JVMs work decently well even with baseline compilers, and in some cases even the interpreter performance is okay (and it matters for time-to-performance).

So, it would be premature to conclude that runtime just gives up?

Practice

Let us try and see how does the virtual call slowpath looks. For that, we make the artificial megamorphic call site in a JMH benchmark: make the three subclasses visiting the same call site:

import org.openjdk.jmh.annotations.*;

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class VirtualCall {

    static abstract class A {
        int c1, c2, c3;
        public abstract void m();
    }

    static class C1 extends A {
        public void m() { c1++; }
    }
    static class C2 extends A {
        public void m() { c2++; }
    }
    static class C3 extends A {
        public void m() { c3++; }
    }

    A[] as;

    @Param({"mono", "mega"})
    private String mode;

    @Setup
    public void setup() {
        as = new A[300];
        boolean mega = mode.equals("mega");
        for (int c = 0; c < 300; c += 3) {
            as[c]   = new C1();
            as[c+1] = mega ? new C2() : new C1();
            as[c+2] = mega ? new C3() : new C1();
        }
    }

    @Benchmark
    public void test() {
        for (A a : as) {
            a.m();
        }
    }
}

To simplify things for analysis, we invoke this with -XX:LoopUnrollLimit=1 -XX:-TieredCompilation: this will block loop unrolling from complicating the disssembly, and disabling tiered compilation would guarantee compilation with the final optimizing compiler. We don’t care about performance numbers all that much, but let’s have them to frame the discussion:

Benchmark         (mode)  Mode  Cnt     Score    Error  Units
VirtualCall.test    mono  avgt    5   325.478 ± 18.156  ns/op
VirtualCall.test    mega  avgt    5  1070.304 ± 53.910  ns/op

To give you some taste of what would happen if we do not use the optimizing compiler on test, run with -XX:CompileCommand=exclude,org.openjdk.VirtualCall::test

Benchmark         (mode)  Mode  Cnt      Score     Error  Units
VirtualCall.test    mono  avgt    5  11598.390 ± 535.593  ns/op
VirtualCall.test    mega  avgt    5  11787.686 ± 884.384  ns/op

So, the megamorphic call does indeed cost something, but it is definitely not interpeter-bad performance. The difference between "mono" and "mega" in optimized case is basically the call overhead: we spend 3ns per element for "mega" case, while spending only 1ns per element in "mono" case.

How does "mega" case look like in perfasm? Like this, with many things pruned for brevity:

....[Hottest Region 1].......................................................................
C2, org.openjdk.generated.VirtualCall_test_jmhTest::test_avgt_jmhStub, version 88 (143 bytes)

  6.93%    5.40%  ↗  0x...5c450: mov    0x40(%rsp),%r9
                  │  ...
  3.65%    4.31%  │  0x...5c47b: callq  0x...0bf60 ;*invokevirtual m
                  │                            ; - org.openjdk.VirtualCall::test@22 (line 76)
                  │                            ;   {virtual_call}
  3.12%    2.34%  │  0x...5c480: inc    %ebp
  3.33%    0.02%  │  0x...5c482: cmp    0x10(%rsp),%ebp
                  ╰  0x...5c486: jl     0x...5c450
                     ...
.............................................................................................
 31.26%   21.77%  <total for region 1>

....[Hottest Region 2].......................................................................
C2, org.openjdk.VirtualCall$C1::m, version 84 (14 bytes) <--- mis-attributed :(

                     ...
                   Decoding VtableStub vtbl[5]@12
  3.95%    1.57%     0x...59bf0: mov    0x8(%rsi),%eax
  3.73%    3.34%     0x...59bf3: shl    $0x3,%rax
  3.73%    5.04%     0x...59bf7: mov    0x1d0(%rax),%rbx
 16.45%   22.42%     0x...59bfe: jmpq   *0x40(%rbx)        ; jump to target
                     0x...59c01: add    %al,(%rax)
                     0x...59c03: add    %al,(%rax)
                     ...
.............................................................................................
 27.87%   32.37%  <total for region 2>

....[Hottest Region 3].......................................................................
C2, org.openjdk.VirtualCall$C3::m, version 86 (26 bytes)

# {method} {0x00007f75aaf4dd50} 'm' '()V' in 'org/openjdk/VirtualCall$C3'

                    ...
                  [Verified Entry Point]
 17.82%   26.04%    0x...595c0: sub    $0x18,%rsp
  0.06%    0.04%    0x...595c7: mov    %rbp,0x10(%rsp)
                    0x...595cc: incl   0x14(%rsi)       ; c3++
  3.53%    5.14%    0x...595cf: add    $0x10,%rsp
                    0x...595d3: pop    %rbp
  3.29%    5.10%    0x...595d4: test   %eax,0x9f01a26(%rip)
  0.02%    0.02%    0x...595da: retq
                    ...
.............................................................................................
 24.73%   36.35%  <total for region 3>

So the benchmarking loop calls into something, which we can assume is the virtual call handler, then it ends up with VirtualStub, that is supposedly does what every other runtime does with virtual calls: jumps the the actual method with the help of Virtual Method Table (VMT).[1]

But wait a minute, this does not compute! The disassembly says we are actually calling to 0x…​0bf60, not into VirtualStub that is at 0x…​59bf0?! And that call is hot, so the call target should also be hot, right? This is where runtime itself plays tricks on us. Even if the compiler bails to optimize the virtual call, the runtime can handle "pessimistic" cases on its own. To diagnose this better, we need to get the fastdebug OpenJDK build, and supply a tracing option for Inline Caches (IC): -XX:+TraceIC. Additionally, we want to save the Hotspot log to file with -prof perfasm:saveLog=true

Lo and behold!

$ grep IC org.openjdk.VirtualCall.test-AverageTime.log
    IC@0x00007fac4fcb428b: to megamorphic {method} {0x00007fabefa81880} 'm' ()V';
                                 in 'org/openjdk/VirtualCall$C2'; entry: 0x00007fac4fcb2ab0

Okay, it says inline cache had acted for the call-site at 0x00007fac4fcb428b. Who is it? This is our Java call!

$ grep -A 4 0x00007fac4fcb428b: org.openjdk.VirtualCall.test-AverageTime.log
   0.02%    0x00007fac4fcb428b: callq  0x00007fac4fb7dda0
                                  ;*invokevirtual m {reexecute=0 rethrow=0 return_oop=0}
                                  ; - org.openjdk.VirtualCall::test@22 (line 76)
                                  ;   {virtual_call}

But what was the address in that Java call? This is the resolving runtime stub:

$ grep -C 2  0x00007fac4fb7dda0 org.openjdk.VirtualCall.test-AverageTime.log
                    0x00007fac4fb7dcdf: hlt
                  Decoding RuntimeStub - resolve_virtual_call 0x00007fac4fb7dd10
                    0x00007fac4fb7dda0: push   %rbp
                    0x00007fac4fb7dda1: mov    %rsp,%rbp
                    0x00007fac4fb7dda4: pushfq

This guy basically called to runtime, figured out what method we want to call, and then asked IC to patch the call to point to new resolved address! Since that is the one-time action, no wonder we do not see it as the hot code. IC action line mentions changing the entry to another address, which is, by the way, our actual VtableStub:

$ grep -C 4 0x00007fac4fcb2ab0: org.openjdk.VirtualCall.test-AverageTime.log
                  Decoding VtableStub vtbl[5]@12
  8.94%    6.49%    0x00007fac4fcb2ab0: mov    0x8(%rsi),%eax
  0.16%    0.06%    0x00007fac4fcb2ab3: shl    $0x3,%rax
  0.20%    0.10%    0x00007fac4fcb2ab7: mov    0x1e0(%rax),%rbx
  2.34%    1.90%    0x00007fac4fcb2abe: jmpq   *0x40(%rbx)
                    0x00007fac4fcb2ac1: int3

In the end, no runtime/compiler calls were needed to dispatch over resolved call: the call-site just calls the VtableStub that does the VMT dispatch — never leaving the generated machine code. This IC machinery would handle virtual monomorphic and static calls in the similar way, pointing to the stub/address that does not do VMT dispatch.

What we see in initial JMH perfasm output is the generated code as it was looking after the compilation, but before the execution and potential runtime patching.[2]

Observations

Just because compiler had failed to optimize for the best case, it does not mean the worst case is abysmally worse. True, you will give up some optimizations, but the overhead would not be so devastating that you would need to avoid virtual calls altogether. This rhymes with the "Black Magic of (Java) Method Dispatch" conclusion: unless you care very much, you don’t have to worry about call performance.


1. Interface calls would be handled similarly, but with a twist during resolution and invocation in the stub.
2. Yet another instance of "Profilers Are Lying Hobbitses (and we hate them!)"