JVM Anatomy Quark #25: Implicit Null Checks

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

Java specification says that NullPointerException would be thrown when we access the null object fields. Does this mean the JVM has to always employ runtime checks for nullity?

Theory

In theory, (JIT) compiler can know that the object is not null and elide the runtime null check, for example when something is constant:

static class Holder { int x; }
static final Holder H = new Holder();

int m() {
  return H.x; // H is known to be not null at JIT compilation time
}

If that does not work, for example when the nullity cannot be inferred automatically, compilers can also employ dataflow analysis to remove the successive null checks after first null check for the object was done. For example:

int m(Holder h) {
  int x1 = h.x; // null-check here
  int x2 = h.x; // no need to null-check here again
  return x1 + x2;
}

Those optimizations are very useful, but quite boring, and they don’t solve the need for null checks in all other cases.

Fortunately, there is even a smarter way to do this: let the user code access the object without the explicit check! Most of the time, nothing bad is going to happen, as most object accesses do not ever see the null object. But we still need to handle the corner case when the null access does happen. When it does, the JVM can intercept the resulting SIGSEGV ("Signal: Segmentation Fault"), look at the return address for that signal, and figure out where that access was made in the generated code. Once it figures that bit out, it can then know where to dispatch the control to handle this case — in most cases, throwing NullPointerException or branching somewhere.

This mechanism is known in Hotspot under the name "implicit null checks". It was recently added to LLVM under the similar name, to cater for the same use case.

Can we see how it works in practice?

Practice

Consider this cunningly simple JMH benchmark:

import org.openjdk.jmh.annotations.*;

import java.util.concurrent.TimeUnit;

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgsAppend = {"-XX:LoopUnrollLimit=1"})
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class ImplicitNP {

    @Param({"false", "true"})
    boolean blowup;

    volatile Holder h;

    int itCnt;

    @Setup
    public void setup() {
        h = null;
        if (blowup && ++itCnt == 3) { // blow it up on 3-rd iteration
            for (int c = 0; c < 10000; c++) {
                try {
                    test();
                } catch (NullPointerException npe) {
                    // swallow
                }
            }
            System.out.print("Boom! ");
        }
        h = new Holder();
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int test() {
        int sum = 0;
        for (int c = 0; c < 100; c++) {
            sum += h.x;
        }
        return sum;
    }

    static class Holder {
        int x;
    }
}

On the surface, this benchmark is simple: it performs the 100x integer addition.

Methodology-wise, this benchmark is cunning in several ways:

It is parametrized by blowup flag that would expose null object to test() method at the 3-rd iteration when blowup = true, and leave it alone otherwise.
It uses the looping in benchmark-unsafe manner. That is mitigated by asking Hotspot to not to unroll the loops with LoopUnrollLimit.
It accesses the same object over and over again. A smart optimizer would be able to hoist the load of h outside the loop, and then aggressively optimize. This is mitigated by declaring h as volatile: unless we are dealing with a God-like-smart optimizer, this is enough to break hoisting.
It uses compiler hints to break inlining for test. This is not, strictly speaking, needed for this benchmark, but it is safety measure. The reasoning goes as follows: the test relies on profiling information for test, and smarter compilers can use caller-callee profiles to split the profile between the version called from setup(), and from the benchmark loop itself.

Out of the curiosity, with recent 8u232,^[1] it yields the following result:

Benchmark        (blowup)  Mode  Cnt   Score   Error  Units
ImplicitNP.test     false  avgt   15  40.417 ± 0.030  ns/op
ImplicitNP.test      true  avgt   15  63.187 ± 0.156  ns/op

Absolute numbers do not matter much here, the important bit is that one of the cases is much faster than the other one. The blowup = false case is significantly faster here. If we drill down into why, we would probably start with characterizing it with the help of -prof perfnorm, which can show the low-level machine counters for both tests:

Benchmark                       (blowup)  Mode  Cnt    Score    Error  Units

ImplicitNP.test                    false  avgt   15   40.484 ±  0.090  ns/op
ImplicitNP.test:L1-dcache-loads    false  avgt    3  206.606 ± 24.336   #/op
ImplicitNP.test:L1-dcache-stores   false  avgt    3    5.861 ±  0.426   #/op
ImplicitNP.test:branches           false  avgt    3  102.972 ± 13.679   #/op
ImplicitNP.test:cycles             false  avgt    3  141.252 ± 22.330   #/op
ImplicitNP.test:instructions       false  avgt    3  521.998 ± 87.292   #/op

ImplicitNP.test                     true  avgt   15   63.254 ±  0.047  ns/op
ImplicitNP.test:L1-dcache-loads     true  avgt    3  206.154 ± 15.231   #/op
ImplicitNP.test:L1-dcache-stores    true  avgt    3    4.971 ±  0.677   #/op
ImplicitNP.test:branches            true  avgt    3  199.993 ± 20.805   #/op ; +100 branches
ImplicitNP.test:cycles              true  avgt    3  221.388 ± 13.126   #/op ;  +80 cycles
ImplicitNP.test:instructions        true  avgt    3  714.439 ± 64.476   #/op ; +190 insns

So, we are hunting some excess branches. Note we had the loop with 100 iterations, so there must be the excess branch per iteration? Also, we have about 200 excess instructions, which makes sense as "branch" is really the test and jcc on x86_64.

Now that we have that hypothesis, let’s see the actual hot code for both cases, with the help of -prof perfasm. The highly edited snippets are below.

First, blowup = false case:

           ...
  1.71%  ↗  0x...020: mov    0x10(%rsi),%r11d       ; get field "h"
  9.19%  │  0x...024: add    0xc(%r12,%r11,8),%eax  ; sum += h.x
         │                                          ; implicit exception:
         │                                          ; dispatches to 0x...03e
 59.60%  │  0x...029: inc    %r10d                  ; increment "c" and loop
  0.02%  │  0x...02c: cmp    $0x64,%r10d
         ╰  0x...030: jl     0x...d204020
  4.57%     0x...032: add    $0x10,%rsp
  3.16%     0x...036: pop    %rbp
  3.37%     0x...037: test   %eax,0x16a18fc3(%rip)
            0x...03d: retq
            0x...03e: mov    $0xfffffff6,%esi
            0x...043: callq  0x00007f8aed0453e0     ; <uncommon trap>
            ...

Here, we can see a very tight loop, and the instruction at 0x…024 combines the compressed reference decoding of h, the access to h.x, and the implicit null check. We do not pay with any additional instructions to check h for nullity.^[2]

The implicit exception: dispatches to 0x…03e line is the part of VM output that says VM knows SEGV exception coming from that instruction had actually failed null check. The JVM signal handler would then do its bidding and dispatch the control to 0x…03e, which would then go on to throwing the exception.^[3]

Of course, if null-s are frequent on that path, going via the signal handler every time is rather slow. For our current case, we could have said that throwing the exception would still be heavy, but it runs into two logistical problems. First, even though exceptions are sometimes slow, there is no reason to make them even slower if we can avoid it. Second, we would like to deal with user-written null-checks using the same machinery, and users would not like their simple if (h == null) { … } else { … } branches run dramatically worse depending on the nullity of h. Therefore, we would like to use implicit null-checks only when the frequency of actual null-s is very low.

Luckily, the JVM can compile the code knowing the runtime profile. That is, when JIT compiler decides whether to emit the implicit null check, it can look into profile and see if the object was ever null. Moreover, even if it does emit the implicit null check, it can recompile the code later when that optimistic assumption about null frequency is violated. blowup = true case specifically violates that assumption by feeding null to our code. As the result, the JVM recompiles the whole thing into: ^[4]

            ...
 11.36%  ↗  0x...bd1: mov    0x10(%rsi),%r11d       ; get field "h"
 12.81%  │  0x...bd5: test   %r11d,%r11d            ; EXPLICIT NULL CHECK
  0.02% ╭│  0x...bd8: je     0x...bf4
 17.23% ││  0x...bda: add    0xc(%r12,%r11,8),%eax  ; sum += h.x
 25.07% ││  0x...bdf: inc    %r10d                  ; increment "c" and loop
  8.70% ││  0x...be2: cmp    $0x64,%r10d
  0.02% │╰  0x...be6: jl     0x...bd1
  3.31% │   0x...be8: add    $0x10,%rsp
  2.49% │   0x...bec: pop    %rbp
  2.72% │   0x...bed: test   %eax,0x160e640d(%rip)
        │   0x...bf3: retq
        ↘   0x...bf4: movabs $0x7821044f8,%rsi      ; <preallocated NullPointerException>
            0x...bfe: mov    %r12d,0x10(%rsi)       ; WTF
            0x...c02: add    $0x10,%rsp
            0x...c06: pop    %rbp
            0x...c07: jmpq   0x00007f887d1053a0     ; throw_exception
            ...

Bam! There is the explicit null check in the generated code now!^[5] Implicit null check turned itself into explicit one, without user intervention.

You can see that in flight when looking into the full benchmark log:

# JMH version: 1.22
# VM version: JDK 1.8.0_232, OpenJDK 64-Bit Server VM, 25.232-b09
# VM options: -XX:LoopUnrollLimit=1
# Warmup: 5 iterations, 1 s each
# Measurement: 5 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.openjdk.ImplicitNP.test
# Parameters: (blowup = true)

# Run progress: 50.00% complete, ETA 00:00:30
# Fork: 1 of 3
Warmup Iteration   1: 40.900 ns/op
Warmup Iteration   2: 40.698 ns/op
Warmup Iteration   3: Boom! 63.157 ns/op  // <--- recompilation happened here
Warmup Iteration   4: 63.158 ns/op
Warmup Iteration   5: 63.130 ns/op
Iteration   1: 63.188 ns/op
Iteration   2: 63.208 ns/op
Iteration   3: 63.128 ns/op
Iteration   4: 63.137 ns/op
Iteration   5: 63.143 ns/op

See, everything was fine the first two iterations, then third iteration exposed null to the code, the JVM noticed that and recompiled.^[6] This gives us more or less flat performance model for null checks.

Other Trivia: Shenandoah GC

Overall, this is quite a useful technique, and so it is used even outside handling the original Java accesses to the heap. For example, Shenandoah GC's load-reference-barrier needs to check if the object is in collection set. If it is not, the barrier can shortcut, as the current object does not move.

In x86_64 code:

................. LRB fastpath............................
     0x...067: testb  $0x1,0x20(%r15)
  ╭  0x...06c: jne    0x...086
..│.............. actual heap access .....................
  │↗ 0x...06e: movl   $0x2a,0xc(%r9)
  ││  ...
..││............. LRB mid path ...........................
..││............. checking in-cset .......................
  ↘│ 0x...086: mov    %r9,%r10
   │ 0x...089: shr    $0x17,%r10           ; %r10 is biased region idx
   │ 0x...08d: movabs $0x7f60d00919f0,%r8  ; %r8 is biased cset bitmap
   │ 0x...097: cmpb   $0x0,(%r8,%r10,1)    ; <--- implicit check for null here!
   ╰ 0x...09c: je     0x...06e
      ...

The "collection set" bit is the property of the region, so there is a global "cset bitmap" that tells which regions are in collection set. To figure out whether the object in in collection set, the code divides the object address by the region size, and then checks against the region bitmap. The caveat here is that heap does not necessarily start at zero address. So, that division does not give you the actual region index. Instead, it gives you the biased region index: something that has the constant offset, depending on the actual heap base. To compensate for it, we can access the cset bitmap itself at its biased offset!

This makes us hit the region bitmap for every legitimate object address, except null, which would access something outside the bitmap. But then we know which address null would hit, and so we can allocate and commit the zero page there, then this check can pretend the answer for null is 0, or "false". And it would do so without handling null-s with separate runtime checks, or involving any signal handling machinery.

Conclusion

Virtual memory provides some nifty tricks when dealing with memory accesses. Implicit null checks profitably exploit the fact that most null checks never actually fire, and let the virtual memory subsystem notify us in case they do. Managed runtimes with recompilation provide us with the way to exploit profile to make the correct guess about the shape of the check, or even dynamically reshape the code when the assumption about null-check frequency was violated. In the end, the whole thing becomes more or less invisible to the user, while providing substantial performance benefits.

1. We use 8u — instead of whatever the bleeding edge JDK release these days is — to show this optimization is not very new ;)

2. In more complicated cases, the simplified control flow and free register/flags not used by the explicit null check give the nice code quality improvements.

3. In this code, it actually feeds into so called "uncommon trap", the topic we would cover eventually. Briefly, this is the notification to the runtime that some never-taken branch is actually taken, and asking JVM to recompile the method using that fact.

4. While this benchmark shows the dynamic recompilation, it can be shown the same effect would be achieved if we fed null-s, and thus updated the initial profile, before the benchmark code was executed.

5. That 0x…bfe: mov %r12d,0x10(%rsi) is a nice low-level WTF.

6. -prof perfasm filters everything that happens during warmup iterations, and this is why we don’t see the disassembly from the previous test.