About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Question
Java specification says that NullPointerException
would be thrown when we access the null
object fields. Does this mean the JVM has to always employ runtime checks for nullity?
Theory
In theory, (JIT) compiler can know that the object is not null
and elide the runtime null check, for example when something is constant:
static class Holder { int x; }
static final Holder H = new Holder();
int m() {
return H.x; // H is known to be not null at JIT compilation time
}
If that does not work, for example when the nullity cannot be inferred automatically, compilers can also employ dataflow analysis to remove the successive null checks after first null check for the object was done. For example:
int m(Holder h) {
int x1 = h.x; // null-check here
int x2 = h.x; // no need to null-check here again
return x1 + x2;
}
Those optimizations are very useful, but quite boring, and they don’t solve the need for null checks in all other cases.
Fortunately, there is even a smarter way to do this: let the user code access the object without the explicit check! Most of the time, nothing bad is going to happen, as most object accesses do not ever see the null object. But we still need to handle the corner case when the null
access does happen. When it does, the JVM can intercept the resulting SIGSEGV ("Signal: Segmentation Fault"), look at the return address for that signal, and figure out where that access was made in the generated code. Once it figures that bit out, it can then know where to dispatch the control to handle this case — in most cases, throwing NullPointerException
or branching somewhere.
This mechanism is known in Hotspot under the name "implicit null checks". It was recently added to LLVM under the similar name, to cater for the same use case.
Can we see how it works in practice?
Practice
Consider this cunningly simple JMH benchmark:
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgsAppend = {"-XX:LoopUnrollLimit=1"})
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class ImplicitNP {
@Param({"false", "true"})
boolean blowup;
volatile Holder h;
int itCnt;
@Setup
public void setup() {
h = null;
if (blowup && ++itCnt == 3) { // blow it up on 3-rd iteration
for (int c = 0; c < 10000; c++) {
try {
test();
} catch (NullPointerException npe) {
// swallow
}
}
System.out.print("Boom! ");
}
h = new Holder();
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
@Benchmark
public int test() {
int sum = 0;
for (int c = 0; c < 100; c++) {
sum += h.x;
}
return sum;
}
static class Holder {
int x;
}
}
On the surface, this benchmark is simple: it performs the 100x integer addition.
Methodology-wise, this benchmark is cunning in several ways:
-
It is parametrized by
blowup
flag that would exposenull
object totest()
method at the 3-rd iteration whenblowup = true
, and leave it alone otherwise. -
It uses the looping in benchmark-unsafe manner. That is mitigated by asking Hotspot to not to unroll the loops with
LoopUnrollLimit
. -
It accesses the same object over and over again. A smart optimizer would be able to hoist the load of
h
outside the loop, and then aggressively optimize. This is mitigated by declaringh
asvolatile
: unless we are dealing with a God-like-smart optimizer, this is enough to break hoisting. -
It uses compiler hints to break inlining for
test
. This is not, strictly speaking, needed for this benchmark, but it is safety measure. The reasoning goes as follows: the test relies on profiling information fortest
, and smarter compilers can use caller-callee profiles to split the profile between the version called fromsetup()
, and from the benchmark loop itself.
Out of the curiosity, with recent 8u232,[1] it yields the following result:
Benchmark (blowup) Mode Cnt Score Error Units
ImplicitNP.test false avgt 15 40.417 ± 0.030 ns/op
ImplicitNP.test true avgt 15 63.187 ± 0.156 ns/op
Absolute numbers do not matter much here, the important bit is that one of the cases is much faster than the other one. The blowup = false
case is significantly faster here. If we drill down into why, we would probably start with characterizing it with the help of -prof perfnorm
, which can show the low-level machine counters for both tests:
Benchmark (blowup) Mode Cnt Score Error Units
ImplicitNP.test false avgt 15 40.484 ± 0.090 ns/op
ImplicitNP.test:L1-dcache-loads false avgt 3 206.606 ± 24.336 #/op
ImplicitNP.test:L1-dcache-stores false avgt 3 5.861 ± 0.426 #/op
ImplicitNP.test:branches false avgt 3 102.972 ± 13.679 #/op
ImplicitNP.test:cycles false avgt 3 141.252 ± 22.330 #/op
ImplicitNP.test:instructions false avgt 3 521.998 ± 87.292 #/op
ImplicitNP.test true avgt 15 63.254 ± 0.047 ns/op
ImplicitNP.test:L1-dcache-loads true avgt 3 206.154 ± 15.231 #/op
ImplicitNP.test:L1-dcache-stores true avgt 3 4.971 ± 0.677 #/op
ImplicitNP.test:branches true avgt 3 199.993 ± 20.805 #/op ; +100 branches
ImplicitNP.test:cycles true avgt 3 221.388 ± 13.126 #/op ; +80 cycles
ImplicitNP.test:instructions true avgt 3 714.439 ± 64.476 #/op ; +190 insns
So, we are hunting some excess branches. Note we had the loop with 100 iterations, so there must be the excess branch per iteration? Also, we have about 200 excess instructions, which makes sense as "branch" is really the test
and jcc
on x86_64.
Now that we have that hypothesis, let’s see the actual hot code for both cases, with the help of -prof perfasm
. The highly edited snippets are below.
First, blowup = false
case:
...
1.71% ↗ 0x...020: mov 0x10(%rsi),%r11d ; get field "h"
9.19% │ 0x...024: add 0xc(%r12,%r11,8),%eax ; sum += h.x
│ ; implicit exception:
│ ; dispatches to 0x...03e
59.60% │ 0x...029: inc %r10d ; increment "c" and loop
0.02% │ 0x...02c: cmp $0x64,%r10d
╰ 0x...030: jl 0x...d204020
4.57% 0x...032: add $0x10,%rsp
3.16% 0x...036: pop %rbp
3.37% 0x...037: test %eax,0x16a18fc3(%rip)
0x...03d: retq
0x...03e: mov $0xfffffff6,%esi
0x...043: callq 0x00007f8aed0453e0 ; <uncommon trap>
...
Here, we can see a very tight loop, and the instruction at 0x…024
combines the compressed reference decoding of h
, the access to h.x
, and the implicit null check. We do not pay with any additional instructions to check h
for nullity.[2]
The implicit exception: dispatches to 0x…03e
line is the part of VM output that says VM knows SEGV exception coming from that instruction had actually failed null check. The JVM signal handler would then do its bidding and dispatch the control to 0x…03e
, which would then go on to throwing the exception.[3]
Of course, if null
-s are frequent on that path, going via the signal handler every time is rather slow. For our current case, we could have said that throwing the exception would still be heavy, but it runs into two logistical problems. First, even though exceptions are sometimes slow, there is no reason to make them even slower if we can avoid it. Second, we would like to deal with user-written null-checks using the same machinery, and users would not like their simple if (h == null) { … } else { … }
branches run dramatically worse depending on the nullity of h
. Therefore, we would like to use implicit null-checks only when the frequency of actual null
-s is very low.
Luckily, the JVM can compile the code knowing the runtime profile. That is, when JIT compiler decides whether to emit the implicit null check, it can look into profile and see if the object was ever null
. Moreover, even if it does emit the implicit null check, it can recompile the code later when that optimistic assumption about null
frequency is violated. blowup = true
case specifically violates that assumption by feeding null
to our code. As the result, the JVM recompiles the whole thing into: [4]
...
11.36% ↗ 0x...bd1: mov 0x10(%rsi),%r11d ; get field "h"
12.81% │ 0x...bd5: test %r11d,%r11d ; EXPLICIT NULL CHECK
0.02% ╭│ 0x...bd8: je 0x...bf4
17.23% ││ 0x...bda: add 0xc(%r12,%r11,8),%eax ; sum += h.x
25.07% ││ 0x...bdf: inc %r10d ; increment "c" and loop
8.70% ││ 0x...be2: cmp $0x64,%r10d
0.02% │╰ 0x...be6: jl 0x...bd1
3.31% │ 0x...be8: add $0x10,%rsp
2.49% │ 0x...bec: pop %rbp
2.72% │ 0x...bed: test %eax,0x160e640d(%rip)
│ 0x...bf3: retq
↘ 0x...bf4: movabs $0x7821044f8,%rsi ; <preallocated NullPointerException>
0x...bfe: mov %r12d,0x10(%rsi) ; WTF
0x...c02: add $0x10,%rsp
0x...c06: pop %rbp
0x...c07: jmpq 0x00007f887d1053a0 ; throw_exception
...
Bam! There is the explicit null check in the generated code now![5] Implicit null check turned itself into explicit one, without user intervention.
You can see that in flight when looking into the full benchmark log:
# JMH version: 1.22
# VM version: JDK 1.8.0_232, OpenJDK 64-Bit Server VM, 25.232-b09
# VM options: -XX:LoopUnrollLimit=1
# Warmup: 5 iterations, 1 s each
# Measurement: 5 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.openjdk.ImplicitNP.test
# Parameters: (blowup = true)
# Run progress: 50.00% complete, ETA 00:00:30
# Fork: 1 of 3
Warmup Iteration 1: 40.900 ns/op
Warmup Iteration 2: 40.698 ns/op
Warmup Iteration 3: Boom! 63.157 ns/op // <--- recompilation happened here
Warmup Iteration 4: 63.158 ns/op
Warmup Iteration 5: 63.130 ns/op
Iteration 1: 63.188 ns/op
Iteration 2: 63.208 ns/op
Iteration 3: 63.128 ns/op
Iteration 4: 63.137 ns/op
Iteration 5: 63.143 ns/op
See, everything was fine the first two iterations, then third iteration exposed null
to the code, the JVM noticed that and recompiled.[6] This gives us more or less flat performance model for null checks.
Other Trivia: Shenandoah GC
Overall, this is quite a useful technique, and so it is used even outside handling the original Java accesses to the heap. For example, Shenandoah GC's load-reference-barrier needs to check if the object is in collection set. If it is not, the barrier can shortcut, as the current object does not move.
In x86_64 code:
................. LRB fastpath............................
0x...067: testb $0x1,0x20(%r15)
╭ 0x...06c: jne 0x...086
..│.............. actual heap access .....................
│↗ 0x...06e: movl $0x2a,0xc(%r9)
││ ...
..││............. LRB mid path ...........................
..││............. checking in-cset .......................
↘│ 0x...086: mov %r9,%r10
│ 0x...089: shr $0x17,%r10 ; %r10 is biased region idx
│ 0x...08d: movabs $0x7f60d00919f0,%r8 ; %r8 is biased cset bitmap
│ 0x...097: cmpb $0x0,(%r8,%r10,1) ; <--- implicit check for null here!
╰ 0x...09c: je 0x...06e
...
The "collection set" bit is the property of the region, so there is a global "cset bitmap" that tells which regions are in collection set. To figure out whether the object in in collection set, the code divides the object address by the region size, and then checks against the region bitmap. The caveat here is that heap does not necessarily start at zero address. So, that division does not give you the actual region index. Instead, it gives you the biased region index: something that has the constant offset, depending on the actual heap base. To compensate for it, we can access the cset bitmap itself at its biased offset!
This makes us hit the region bitmap for every legitimate object address, except null
, which would access something outside the bitmap. But then we know which address null
would hit, and so we can allocate and commit the zero page there, then this check can pretend the answer for null
is 0
, or "false". And it would do so without handling null
-s with separate runtime checks, or involving any signal handling machinery.
Conclusion
Virtual memory provides some nifty tricks when dealing with memory accesses. Implicit null checks profitably exploit the fact that most null checks never actually fire, and let the virtual memory subsystem notify us in case they do. Managed runtimes with recompilation provide us with the way to exploit profile to make the correct guess about the shape of the check, or even dynamically reshape the code when the assumption about null-check frequency was violated. In the end, the whole thing becomes more or less invisible to the user, while providing substantial performance benefits.