About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Question
I have heard megamorphic virtual calls are so bad, they are getting called by interpreter, not optimizing compiler. Is that true?
Theory
If you have read numerous articles about virtual call optimization in Hotspot, you may have left with the belief that megamorphic calls are pure evil, because they invoke the slowpath handling, and do not enjoy compiler optimizations. If you try to comprehend what OpenJDK does when it fails to devirtualize the call, you might wonder that it crashes and burns performance-wise. But, consider that JVMs work decently well even with baseline compilers, and in some cases even the interpreter performance is okay (and it matters for time-to-performance).
So, it would be premature to conclude that runtime just gives up?
Practice
Let us try and see how does the virtual call slowpath looks. For that, we make the artificial megamorphic call site in a JMH benchmark: make the three subclasses visiting the same call site:
import org.openjdk.jmh.annotations.*;
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class VirtualCall {
static abstract class A {
int c1, c2, c3;
public abstract void m();
}
static class C1 extends A {
public void m() { c1++; }
}
static class C2 extends A {
public void m() { c2++; }
}
static class C3 extends A {
public void m() { c3++; }
}
A[] as;
@Param({"mono", "mega"})
private String mode;
@Setup
public void setup() {
as = new A[300];
boolean mega = mode.equals("mega");
for (int c = 0; c < 300; c += 3) {
as[c] = new C1();
as[c+1] = mega ? new C2() : new C1();
as[c+2] = mega ? new C3() : new C1();
}
}
@Benchmark
public void test() {
for (A a : as) {
a.m();
}
}
}
To simplify things for analysis, we invoke this with -XX:LoopUnrollLimit=1 -XX:-TieredCompilation
: this will block loop unrolling from complicating the disssembly, and disabling tiered compilation would guarantee compilation with the final optimizing compiler. We don’t care about performance numbers all that much, but let’s have them to frame the discussion:
Benchmark (mode) Mode Cnt Score Error Units
VirtualCall.test mono avgt 5 325.478 ± 18.156 ns/op
VirtualCall.test mega avgt 5 1070.304 ± 53.910 ns/op
To give you some taste of what would happen if we do not use the optimizing compiler on test, run with -XX:CompileCommand=exclude,org.openjdk.VirtualCall::test
Benchmark (mode) Mode Cnt Score Error Units
VirtualCall.test mono avgt 5 11598.390 ± 535.593 ns/op
VirtualCall.test mega avgt 5 11787.686 ± 884.384 ns/op
So, the megamorphic call does indeed cost something, but it is definitely not interpeter-bad performance. The difference between "mono" and "mega" in optimized case is basically the call overhead: we spend 3ns per element for "mega" case, while spending only 1ns per element in "mono" case.
How does "mega" case look like in perfasm
? Like this, with many things pruned for brevity:
....[Hottest Region 1].......................................................................
C2, org.openjdk.generated.VirtualCall_test_jmhTest::test_avgt_jmhStub, version 88 (143 bytes)
6.93% 5.40% ↗ 0x...5c450: mov 0x40(%rsp),%r9
│ ...
3.65% 4.31% │ 0x...5c47b: callq 0x...0bf60 ;*invokevirtual m
│ ; - org.openjdk.VirtualCall::test@22 (line 76)
│ ; {virtual_call}
3.12% 2.34% │ 0x...5c480: inc %ebp
3.33% 0.02% │ 0x...5c482: cmp 0x10(%rsp),%ebp
╰ 0x...5c486: jl 0x...5c450
...
.............................................................................................
31.26% 21.77% <total for region 1>
....[Hottest Region 2].......................................................................
C2, org.openjdk.VirtualCall$C1::m, version 84 (14 bytes) <--- mis-attributed :(
...
Decoding VtableStub vtbl[5]@12
3.95% 1.57% 0x...59bf0: mov 0x8(%rsi),%eax
3.73% 3.34% 0x...59bf3: shl $0x3,%rax
3.73% 5.04% 0x...59bf7: mov 0x1d0(%rax),%rbx
16.45% 22.42% 0x...59bfe: jmpq *0x40(%rbx) ; jump to target
0x...59c01: add %al,(%rax)
0x...59c03: add %al,(%rax)
...
.............................................................................................
27.87% 32.37% <total for region 2>
....[Hottest Region 3].......................................................................
C2, org.openjdk.VirtualCall$C3::m, version 86 (26 bytes)
# {method} {0x00007f75aaf4dd50} 'm' '()V' in 'org/openjdk/VirtualCall$C3'
...
[Verified Entry Point]
17.82% 26.04% 0x...595c0: sub $0x18,%rsp
0.06% 0.04% 0x...595c7: mov %rbp,0x10(%rsp)
0x...595cc: incl 0x14(%rsi) ; c3++
3.53% 5.14% 0x...595cf: add $0x10,%rsp
0x...595d3: pop %rbp
3.29% 5.10% 0x...595d4: test %eax,0x9f01a26(%rip)
0.02% 0.02% 0x...595da: retq
...
.............................................................................................
24.73% 36.35% <total for region 3>
So the benchmarking loop calls into something, which we can assume is the virtual call handler, then it ends up with VirtualStub, that is supposedly does what every other runtime does with virtual calls: jumps the the actual method with the help of Virtual Method Table (VMT).[1]
But wait a minute, this does not compute! The disassembly says we are actually calling to 0x…0bf60
, not into VirtualStub
that is at 0x…59bf0
?! And that call is hot, so the call target should also be hot, right? This is where runtime itself plays tricks on us. Even if the compiler bails to optimize the virtual call, the runtime can handle "pessimistic" cases on its own. To diagnose this better, we need to get the fastdebug OpenJDK build, and supply a tracing option for Inline Caches (IC): -XX:+TraceIC
. Additionally, we want to save the Hotspot log to file with -prof perfasm:saveLog=true
Lo and behold!
$ grep IC org.openjdk.VirtualCall.test-AverageTime.log
IC@0x00007fac4fcb428b: to megamorphic {method} {0x00007fabefa81880} 'm' ()V';
in 'org/openjdk/VirtualCall$C2'; entry: 0x00007fac4fcb2ab0
Okay, it says inline cache had acted for the call-site at 0x00007fac4fcb428b
. Who is it? This is our Java call!
$ grep -A 4 0x00007fac4fcb428b: org.openjdk.VirtualCall.test-AverageTime.log
0.02% 0x00007fac4fcb428b: callq 0x00007fac4fb7dda0
;*invokevirtual m {reexecute=0 rethrow=0 return_oop=0}
; - org.openjdk.VirtualCall::test@22 (line 76)
; {virtual_call}
But what was the address in that Java call? This is the resolving runtime stub:
$ grep -C 2 0x00007fac4fb7dda0 org.openjdk.VirtualCall.test-AverageTime.log
0x00007fac4fb7dcdf: hlt
Decoding RuntimeStub - resolve_virtual_call 0x00007fac4fb7dd10
0x00007fac4fb7dda0: push %rbp
0x00007fac4fb7dda1: mov %rsp,%rbp
0x00007fac4fb7dda4: pushfq
This guy basically called to runtime, figured out what method we want to call, and then asked IC to patch the call to point to new resolved address! Since that is the one-time action, no wonder we do not see it as the hot code. IC action line mentions changing the entry to another address, which is, by the way, our actual VtableStub:
$ grep -C 4 0x00007fac4fcb2ab0: org.openjdk.VirtualCall.test-AverageTime.log
Decoding VtableStub vtbl[5]@12
8.94% 6.49% 0x00007fac4fcb2ab0: mov 0x8(%rsi),%eax
0.16% 0.06% 0x00007fac4fcb2ab3: shl $0x3,%rax
0.20% 0.10% 0x00007fac4fcb2ab7: mov 0x1e0(%rax),%rbx
2.34% 1.90% 0x00007fac4fcb2abe: jmpq *0x40(%rbx)
0x00007fac4fcb2ac1: int3
In the end, no runtime/compiler calls were needed to dispatch over resolved call: the call-site just calls the VtableStub
that does the VMT dispatch — never leaving the generated machine code. This IC machinery would handle virtual monomorphic and static calls in the similar way, pointing to the stub/address that does not do VMT dispatch.
What we see in initial JMH perfasm output is the generated code as it was looking after the compilation, but before the execution and potential runtime patching.[2]
Observations
Just because compiler had failed to optimize for the best case, it does not mean the worst case is abysmally worse. True, you will give up some optimizations, but the overhead would not be so devastating that you would need to avoid virtual calls altogether. This rhymes with the "Black Magic of (Java) Method Dispatch" conclusion: unless you care very much, you don’t have to worry about call performance.