About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

350

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Questions

What is the finest unit for JIT compilation? If JIT decides to compile the method, does it compile everything in it? Should I warm up the methods using the real data? What tricks do JIT compilers have to optimize their compilation time?

Theory

It is a common wisdom that JIT compilers work on methods: once a method is deemed hot, the runtime system asks JIT compiler to produce an optimized version of it. It follows, naively, that JIT compiles the entirety of the method and hands it over to the runtime system.

But the fact is, the runtime system that allows speculative compilation/deoptimization allows JIT to compile methods with the sets of assumptions about its behavior. We have seen it before in Implicit Null Checks. This time, we would look at a more general thing about the cold code.

Consider this method that is effectively called only with flag = true:

void m(boolean flag) {
  if (flag) {
     // do stuff A
  } else {
     // do stuff B
  }
}

Even if flag is not known from the analysis, the smart JIT compiler can use the branch profiling to figure out that "B" branch is never taken, and compile it to:

void m() {
  if (condition) {
     // do stuff A
  } else {
     // Assume this branch is never taken.
     <trap to runtime system: uncommon branch is taken>
  }
}

Thus, never actually compiling the actual code in branch B. This saves compilation time, usually improves code density, by avoiding dealing with code that would never be needed.

Note this is different from the code layout based on branch frequency. In this case, when one of the branches frequencies is exactly zero, we can skip compiling its body completely. If and only if the branch is taken, the generated code traps to runtime system saying that the compilation pre-condition was violated, and JIT would regenerate the method body in the new conditions, this time compiling now-not-uncommon branch.

Can we see this in practice?

Test

Consider this JMH benchmark:

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class ColdCodeBench {

    @Param({"onlyA", "onlyB", "swap"})
    String test;

    boolean condition;

    @Setup(Level.Iteration)
    public void setup() {
        switch (test) {
            case "onlyA":
                condition = true;
                break;
            case "onlyB":
                condition = false;
                break;
            case "swap":
                condition = !condition;
                break;
        }
    }

    int v = 1;
    int a = 1;
    int b = 1;

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void test() {
        if (condition) {
            v *= a;
        } else {
            v *= b;
        }
    }
}

In this test, we either taking the branch A only, or taking the branch B only, or we flip-flop between them on every iteration.

The point of this test is to demonstrate generated code in a simple manner. The performance for all versions would be roughly the same in this trivial test. In reality, the cold branches can take a lot of code, especially after inlining, and the performance impact on both compilation times and generated code density would be substantial.

Not surprisingly and in line with "Frequency-based Code Layout", we can see that both "onlyA" and "onlyB" tests lay out the first branch right away. But then, a curious thing happens: there is no second branch code at all! Instead, there is a call to so called "uncommon trap"! That one is the notification to runtime that we have failed the compilation condition, and this "uncommon" branch is now taken.

# "onlyA"
  9.54%   ...3cc: movzbl 0x18(%rsi),%r10d  ; load and test $condition
  0.21%   ...3d1: test   %r10d,%r10d
        ╭ ...3d4: je     ...3f6
        │                                  ; if true, then...
  0.90% │ ...3d6: mov    0x10(%rsi),%r10d  ; ...load $a...
  7.81% │ ...3da: imul   0xc(%rsi),%r10d   ; ...multiply by $v...
 17.33% │ ...3df: mov    %r10d,0xc(%rsi)   ; ...store to $v...
  8.16% │ ...3e3: add    $0x20,%rsp        ; ...and return.
  0.60% │ ...3e7: pop    %rbp
  0.18% │ ...3e8: cmp    0x340(%r15),%rsp
  0.02% │ ...3ef: ja     ...408
 10.51% │ ...3f5: retq
        │                                  ; if false, then...
        ↘ ...3f6: mov    %rsi,%rbp
          ...3f9: mov    %r10d,(%rsp)
          ...3fd: mov    $0xffffff45,%esi
          ...402: nop
          ...403: callq  <runtime>         ; - (reexecute) o.o.CCB::test@4 (line 73)
                                           ;   {runtime_call UncommonTrapBlob}

# "onlyB"
 10.21%   ...acc: movzbl 0x18(%rsi),%r10d  ; load and test $condition
  0.25%   ...ad1: test   %r10d,%r10d
        ╭ ...ad4: jne    ...af6
        │                                  ; if false, then...
  0.29% │ ...ad6: mov    0x14(%rsi),%r10d  ; ...load $b...
  8.78% │ ...ada: imul   0xc(%rsi),%r10d   ; ...multiply by $v...
 18.87% │ ...adf: mov    %r10d,0xc(%rsi)   ; ...store $v...
  9.74% │ ...ae3: add    $0x20,%rsp        ; ...and return.
  0.24% │ ...ae7: pop    %rbp
  0.27% │ ...ae8: cmp    0x340(%r15),%rsp
        │ ...aef: ja     ...b08
  9.76% │ ...af5: retq
        │                                  ; if true, then...
        ↘ ...af6: mov    %rsi,%rbp
          ...af9: mov    %r10d,(%rsp)
          ...afd: mov    $0xffffff45,%esi
          ...b02: nop
          ...b03: callq  <runtime>         ; - (reexecute) o.o.CCB::test@4 (line 73)
                                           ;   {runtime_call UncommonTrapBlob}

When that "cold" branch is finally taken, then JVM would recompile the method. It would be visible in -XX:+PrintCompilation log like this:

# Warmup Iteration   1: ...

// Profiled version is compiled with C1 (+MDO)
    351  476       3       org.openjdk.ColdCodeBench::test (37 bytes)

// C2 version is installed
    352  477       4       org.openjdk.ColdCodeBench::test (37 bytes)

// Profiled version is declared dead
    352  476       3       org.openjdk.ColdCodeBench::test (37 bytes)   made not entrant

# Warmup Iteration   2: ...

// Deopt! C2 version is declared dead
   1361  477       4       org.openjdk.ColdCodeBench::test (37 bytes)   made not entrant

// Re-profiling version is compiled with C1 (+counters)
   1363  498       2       org.openjdk.ColdCodeBench::test (37 bytes)

// New C2 version is installed
   1364  499       4       org.openjdk.ColdCodeBench::test (37 bytes)

// Re-profiling version is declared dead
   1364  498       2       org.openjdk.ColdCodeBench::test (37 bytes)   made not entrant

The final result is clearly visible in "swap" case. There, both branches are compiled:

  4.25%    ...f2c: mov    0xc(%rsi),%r11d  ; load $v
  6.23%    ...f30: movzbl 0x18(%rsi),%r10d ; load and test $condition
  0.04%    ...f35: test   %r10d,%r10d
        ╭  ...f38: je     ...f45
        │                                  ; if false, then
  0.02% │  ...f3a: imul   0x10(%rsi),%r11d ; ...multiply by $a...
 13.33% │  ...f3f: mov    %r11d,0xc(%rsi)  ; ...store $v
  3.82% │╭ ...f43: jmp    ...f4e
        ││                                 ; if true, then
  0.02% ↘│ ...f45: imul   0x14(%rsi),%r11d ; ...multiply by $b...
 18.70%  │ ...f4a: mov    %r11d,0xc(%rsi)  ; ...store $v
  6.12%  ↘ ...f4e: add    $0x10,%rsp
           ...f52: pop    %rbp
  0.08%    ...f53: cmp    0x340(%r15),%rsp
           ...f5a: ja     ...f61
 10.81%    ...f60: retq

Conclusion

Advanced JIT compilers can compile only the actually active parts of the method. This simplifies generated code and JIT compiler overheads. On the other hand, this complicates the warmup: to avoid sudden recompilation, you need to warm up with the similar profile as you would run later, so that all paths are compiled.