Яндекс.Метрика

About

"JVM Anatomy Park" is the on-going mini-post series, where every post is slated to take 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. So, the evidence and discussion here are anecdotal, not actually reviewed for errors, consistency, writing style, syntactic and semantic errors, duplicates, or consistency. Use and/or trust this at your own risk.

Aleksey Shipilёv, Performance Geek @ Red Hat OpenJDK Team
Shout out at Twitter: @shipilev
Questions, comments, suggestions: aleksey@shipilev.net

Question

Surely there are constant values in the program that optimizers can exploit. Does JVM do any tricks there?

Theory

Of course, constant-based optimizations are among the most profitable ones around. Nothing beats not doing the work at run time, when it can be done at compile time. But what is the constant? It seems that plain fields are not constants: they change all the time. What about final-s? They should stay the same. But, since instance fields are the part of the object state, final instance fields values also depend on the identity of the object in question:

class M {
  final int x;
  M(int x) { this.x = x; }
}

M m1 = new M(1337);
M m2 = new M(8080);

void work(M m) {
  return m.x; // what to compile in here, 1337 or 8080?
}

Therefore, it stands to reason that if we compile method work above without knowing anything about the identity of the object coming as the argument [1], the only thing we can trust is static final fields: they are unchangeable because of final, and we know exactly the identity of "holding object", because it is held by the class, not by the every individual object.

Can we observe this in practice?

Practice

Consider this JMH benchmark:

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class JustInTimeConstants {

    static final long x_static_final = Long.getLong("divisor", 1000);
    static       long x_static       = Long.getLong("divisor", 1000);
           final long x_inst_final   = Long.getLong("divisor", 1000);
                 long x_inst         = Long.getLong("divisor", 1000);

    @Benchmark public long _static_final() { return 1000 / x_static_final; }
    @Benchmark public long _static()       { return 1000 / x_static;       }
    @Benchmark public long _inst_final()   { return 1000 / x_inst_final;   }
    @Benchmark public long _inst()         { return 1000 / x_inst;         }

}

It is carefully constructed so that compilers can use the fact that divisor is constant and optimize the division out. If we run this test, this is what we shall see this:

Benchmark                          Mode  Cnt  Score   Error  Units
JustInTimeConstants._inst          avgt   15  9.670 ± 0.014  ns/op
JustInTimeConstants._inst_final    avgt   15  9.690 ± 0.036  ns/op
JustInTimeConstants._static        avgt   15  9.705 ± 0.015  ns/op
JustInTimeConstants._static_final  avgt   15  1.899 ± 0.001  ns/op

Briefly studying the hottest loop in this benchmark with -prof perfasm reveals a few implementation details and the reason why some tests are faster.

_inst and _inst_final are not surprising: they read the field and use it as divisor. The bulk of cycles is spent doing the actual integer division:

# JustInTimeConstants._inst / _inst_final hottest loop
0.21%            ↗  mov    0x40(%rsp),%r10
0.02%            │  mov    0x18(%r10),%r10    ; get field x_inst / x_inst_final
                 |  ...
0.13%            │  idiv   %r10               ; ldiv
76.59%   95.38%  │  mov    0x38(%rsp),%rsi    ; prepare and consume the value (JMH infra)
0.40%            │  mov    %rax,%rdx
0.10%            │  callq  CONSUME
                 |  ...
1.51%            │  test   %r11d,%r11d        ; call @Benchmark again
                 ╰  je     BACK

_static is a bit more interesting: it reads the static field off the native class mirror, where static fields reside. Since runtime knows what class we are dealing with (static field accesses are statically resolved!), we inline the constant pointer to mirror, and access the field by its predefined offset. But, since we don’t know what is the value of the field — indeed someone could have changed it after the code was generated — we still do the same integer division:

# JustInTimeConstants._static hottest loop
0.04%            ↗  movabs $0x7826385f0,%r10  ; native mirror for JustInTimeConstants.class
0.02%            │  mov    0x70(%r10),%r10    ; get static x_static
                 |  ...
0.02%            │  idiv   %r10               ;*ldiv
72.78%   95.51%  |  mov    0x38(%rsp),%rsi    ; prepare and consume the value (JMH infra)
0.38%            │  mov    %rax,%rdx
0.04%    0.06%   │  data16 xchg %ax,%ax
         0.02%   │  callq  CONSUME
                 |  ...
0.13%            │  test   %r11d,%r11d        ; call @Benchmark again
                 ╰  je     BACK

_static_final is the most interesting of them all. JIT compiler knows exactly the value it is dealing with, and so it can aggressively optimize for it. Here, the loop computation just reuses the slot which holds the precomputed value of "1000 / 1000", which is "1" [2]:

# JustInTimeConstants._static_final hottest loop
1.36%    1.40%   ↗  mov    %r8,(%rsp)
7.73%    7.40%   │  mov    0x8(%rsp),%rdx       ; <--- slot holding the "long" constant "1"
0.45%    0.51%   │  mov    0x38(%rsp),%rsi      ; prepare and consume the value (JMH infra)
3.59%    3.24%   │  nop
1.44%    0.54%   │  callq  CONSUME
                 | ...
3.46%    2.37%   │  test   %r10d,%r10d          ; call @Benchmark again
                 ╰  je     BACK

So the performance is explained by compiler’s ability to constant fold through static final.

Observations

Note that in this example, the bytecode compiler (e.g. javac) has no idea what is the value of static final field is, because that field is initialized with a runtime value. When JIT compilation happens, the class had succeeded initialization, and the value is there, and can be used! This is really the just-in-time constant. This allows to develop the very efficient, yet runtime-adjustable code: indeed the whole thing was thought up as the replacement for preprocessor-based asserts.[3] I frequently miss this kind of trick in C++ land, where compilation is fully ahead-of-time, and thus you have to be creative if you want to have critical code depend on runtime options.[4]

A significant part of the story is the interpreter / tiered compilation. Class initializers are usually cold code, because they are executed once. But the more important thing is handling the lazy part of class initialization, when we want to load and initialize class the very first time on the very first access to field. Interpreter or baseline JIT compiler (e.g. C1 in Hotspot) runs it for us. By the time optimizing JIT compiler (e.g. C2 in Hotspot) runs for the same method, the classes that recompiled method needs are usually fully initialized, and their static final-s are fully known.


1. This does not preclude flow-based optimizations, like calling inlined work(new M(4242))
2. Doing the same test with int-s instead of long-s would yield actual mov $0x1, %edx, but I am too lazy to reformat all the assembly listings for this case.
3. And this had not fully worked, because default inlining heuristics still counts the method size by the bytecode length, regardless of how much dead code is there. This gradually changes with e.g. incremental inlining.
4. This almost inevitably devolves into template and/or metaprogramming mess we love to write, but hate to debug.