About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Question
Surely there are constant values in the program that optimizers can exploit. Does JVM do any tricks there?
Theory
Of course, constant-based optimizations are among the most profitable ones around. Nothing beats not doing the work at run time, when it can be done at compile time. But what is the constant? It seems that plain fields are not constants: they change all the time. What about final
-s? They should stay the same. But, since instance fields are the part of the object state, final
instance fields values also depend on the identity of the object in question:
class M {
final int x;
M(int x) { this.x = x; }
}
M m1 = new M(1337);
M m2 = new M(8080);
void work(M m) {
return m.x; // what to compile in here, 1337 or 8080?
}
Therefore, it stands to reason that if we compile method work
above without knowing anything about the identity of the object coming as the argument [1], the only thing we can trust is static final fields: they are unchangeable because of final
, and we know exactly the identity of "holding object", because it is held by the class, not by the every individual object.
Can we observe this in practice?
Practice
Consider this JMH benchmark:
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class JustInTimeConstants {
static final long x_static_final = Long.getLong("divisor", 1000);
static long x_static = Long.getLong("divisor", 1000);
final long x_inst_final = Long.getLong("divisor", 1000);
long x_inst = Long.getLong("divisor", 1000);
@Benchmark public long _static_final() { return 1000 / x_static_final; }
@Benchmark public long _static() { return 1000 / x_static; }
@Benchmark public long _inst_final() { return 1000 / x_inst_final; }
@Benchmark public long _inst() { return 1000 / x_inst; }
}
It is carefully constructed so that compilers can use the fact that divisor is constant and optimize the division out. If we run this test, this is what we shall see this:
Benchmark Mode Cnt Score Error Units
JustInTimeConstants._inst avgt 15 9.670 ± 0.014 ns/op
JustInTimeConstants._inst_final avgt 15 9.690 ± 0.036 ns/op
JustInTimeConstants._static avgt 15 9.705 ± 0.015 ns/op
JustInTimeConstants._static_final avgt 15 1.899 ± 0.001 ns/op
Briefly studying the hottest loop in this benchmark with -prof perfasm
reveals a few implementation details and the reason why some tests are faster.
_inst
and _inst_final
are not surprising: they read the field and use it as divisor. The bulk of cycles is spent doing the actual integer division:
# JustInTimeConstants._inst / _inst_final hottest loop
0.21% ↗ mov 0x40(%rsp),%r10
0.02% │ mov 0x18(%r10),%r10 ; get field x_inst / x_inst_final
| ...
0.13% │ idiv %r10 ; ldiv
76.59% 95.38% │ mov 0x38(%rsp),%rsi ; prepare and consume the value (JMH infra)
0.40% │ mov %rax,%rdx
0.10% │ callq CONSUME
| ...
1.51% │ test %r11d,%r11d ; call @Benchmark again
╰ je BACK
_static
is a bit more interesting: it reads the static field off the native class mirror, where static fields reside. Since runtime knows what class we are dealing with (static field accesses are statically resolved!), we inline the constant pointer to mirror, and access the field by its predefined offset. But, since we don’t know what is the value of the field — indeed someone could have changed it after the code was generated — we still do the same integer division:
# JustInTimeConstants._static hottest loop
0.04% ↗ movabs $0x7826385f0,%r10 ; native mirror for JustInTimeConstants.class
0.02% │ mov 0x70(%r10),%r10 ; get static x_static
| ...
0.02% │ idiv %r10 ;*ldiv
72.78% 95.51% | mov 0x38(%rsp),%rsi ; prepare and consume the value (JMH infra)
0.38% │ mov %rax,%rdx
0.04% 0.06% │ data16 xchg %ax,%ax
0.02% │ callq CONSUME
| ...
0.13% │ test %r11d,%r11d ; call @Benchmark again
╰ je BACK
_static_final
is the most interesting of them all. JIT compiler knows exactly the value it is dealing with, and so it can aggressively optimize for it. Here, the loop computation just reuses the slot which holds the precomputed value of "1000 / 1000", which is "1" [2]:
# JustInTimeConstants._static_final hottest loop
1.36% 1.40% ↗ mov %r8,(%rsp)
7.73% 7.40% │ mov 0x8(%rsp),%rdx ; <--- slot holding the "long" constant "1"
0.45% 0.51% │ mov 0x38(%rsp),%rsi ; prepare and consume the value (JMH infra)
3.59% 3.24% │ nop
1.44% 0.54% │ callq CONSUME
| ...
3.46% 2.37% │ test %r10d,%r10d ; call @Benchmark again
╰ je BACK
So the performance is explained by compiler’s ability to constant fold through static final
.
Observations
Note that in this example, the bytecode compiler (e.g. javac) has no idea what is the value of static final
field is, because that field is initialized with a runtime value. When JIT compilation happens, the class had succeeded initialization, and the value is there, and can be used! This is really the just-in-time constant. This allows to develop the very efficient, yet runtime-adjustable code: indeed the whole thing was thought up as the replacement for preprocessor-based asserts.[3] I frequently miss this kind of trick in C++ land, where compilation is fully ahead-of-time, and thus you have to be creative if you want to have critical code depend on runtime options.[4]
A significant part of the story is the interpreter / tiered compilation. Class initializers are usually cold code, because they are executed once. But the more important thing is handling the lazy part of class initialization, when we want to load and initialize class the very first time on the very first access to field. Interpreter or baseline JIT compiler (e.g. C1 in Hotspot) runs it for us. By the time optimizing JIT compiler (e.g. C2 in Hotspot) runs for the same method, the classes that recompiled method needs are usually fully initialized, and their static final
-s are fully known.