About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Questions
How does JMH avoids dead-code elimination for nano-benchmarks? Is there an implicit or explicit compiler support?
Theory
Optimizing compilers are good at optimizing simple stuff. For example, if there is a computation that is not observable by anyone, it can be deemed "dead code" and eliminated.
It is usually a good thing, until you run benchmarks. There, you want the computation, but you don’t need the result. In essence, you observe the "resources" taken by the benchmark, but there is no easy way to argue this with a compiler.
So a benchmark like this:
int x, y;
@Benchmark
public void test_dead() {
int r = x + y;
}
…would be routinely compiled like this:[1]
1.72% ↗ ...370: movzbl 0x94(%r9),%r10d ; load $isDone
2.06% │ ...378: mov 0x348(%r15),%r11 ; safepoint poll, part 1
27.91% │ ...37f: add $0x1,%rbp ; ops++;
28.56% │ ...383: test %eax,(%r11) ; safepoint poll, part 2
33.43% │ ...386: test %r10d,%r10d ; are we done? spin back if not.
╰ ...389: je ...370
That is, only the benchmark infrastructure remains, with no actual x + y
in sight. That code was dead, and it was eliminated.
Pure Java Blackholes
Since forever, JMH provides the way to avoid dead-code elimination by accepting the result from the benchmark. Under the hood, it is done by feeding that result into a Blackhole, that can also be used directly in some cases.
In short, the Blackhole has to achieve a single side effect on the incoming argument: pretend it is used. The Blackhole implementation notes describe what the Blackhole implementor has to deal with when trying to cooperate with compiler. Implementing it efficiently is an fine exercise in near-JVM engineering.
Anyhow, all that mess is hidden from JMH users, so they can just do:
int x, y;
@Benchmark
public int test_return() {
return x + y;
}
If you look at the generated code, though, you would see that both the computation and the Blackhole
code is there:
main loop:
2.09% ↗ ...e32: mov 0x40(%rsp),%r10 ; load $this
7.46% │ ...e37: mov 0x10(%r10),%edx ; load $this.x
0.64% │ ...e3b: add 0xc(%r10),%edx ; add $this.y
2.11% │ ...e3f: mov 0x38(%rsp),%rsi ; call Blackhole.consume
1.74% │ ...e44: data16 xchg %ax,%ax
6.52% │ ...e47: callq ...a80
18.37% │ ...e4c: mov (%rsp),%r10
1.50% │ ...e50: movzbl 0x94(%r10),%r11d ; load $isDone
2.85% │ ...e58: mov 0x348(%r15),%r10 ; safepoint poll, part 1
6.74% │ ...e5f: add $0x1,%rbp ; ops++
0.62% │ ...e63: test %eax,(%r10) ; safepoint poll, part 2
0.66% │ ...e66: test %r11d,%r11d ; are we done? spin back if not.
╰ ...e69: je ...e32
Blackhole.consume:
2.34% ...040: mov %eax,-0x14000(%rsp) ; too
9.14% ...047: push %rbp ; lazy
0.64% ...048: sub $0x20,%rsp ; to
3.38% ...04c: mov %edx,%r11d ; cross-reference
6.66% ...04f: xor 0xb0(%rsi),%r11d ; this
0.68% ...056: mov %edx,%r8d ; with
1.76% ...059: xor 0xb8(%rsi),%r8d ; the
1.62% ...060: cmp %r8d,%r11d ; actual
╭ ...063: je ...078 ; Blackhole
7.22% │ ...065: add $0x20,%rsp ; code
0.35% │ ...069: pop %rbp
2.01% │ ...06a: cmp 0x340(%r15),%rsp
│ ...071: ja ...094
8.53% │ ...077: retq
↘ ...078: mov %rsi,%rbp
Not surprisingly, the Blackhole
costs dominate such a tiny benchmark. With -prof perfnorm
, we can see how bad it is:
Benchmark Mode Cnt Score Error Units
XplusY.test_return avgt 25 3.288 ± 0.032 ns/op
XplusY.test_return:L1-dcache-loads avgt 5 13.092 ± 0.487 #/op
XplusY.test_return:L1-dcache-stores avgt 5 3.031 ± 0.076 #/op
XplusY.test_return:branches avgt 5 5.031 ± 0.089 #/op
XplusY.test_return:cycles avgt 5 8.781 ± 0.351 #/op
XplusY.test_return:instructions avgt 5 27.162 ± 0.489 #/op
That is, our "payload" is only 2 instructions, yet the whole benchmark takes another 25 instructions on top of them! Yes, modern CPUs can execute that whole bunch of instructions in about 9 cycles here, but it is still too much work. To add insult to injury, the calling code and related stack management introduced stores.
The benchmark itself takes about 3.2 ns/op, which puts a lower limit on the effects we can reliably measure.
Compiler Blackholes
Luckily, we can ask a more direct cooperation from the compiler, with the use of compiler blackholes. Those are implemented in OpenJDK 17 with JDK-8259316, with the plan to backport it to 11u as well. Compiler blackholes are instructing the compilers to carry all arguments through the optimization phases, and then finally drop them when emitting the generated code. Then, as long as hardware itself does not provide surprises to us, we should be good.[2]
They are supposed to work transparently for JMH users, but since the whole thing is experimental, at this time JMH users are required to opt-in to compiler blackholes with -Djmh.blackhole.mode=COMPILER
and then check the generated code for correctness.[3] Indeed, using compiler blackholes with our benchmark, we can see that the computation is still there, and there is no Blackhole
call anymore!
8.95% ↗ ...c00: mov 0x10(%r11),%r10d ; load $this.x
0.36% │ ...c04: add 0xc(%r11),%r10d ; add $this.y
│ ; (AND COMPILER BLACKHOLE IT)
0.94% │ ...c08: movzbl 0x94(%r14),%r8d ; load $isDone
26.76% │ ...c10: mov 0x348(%r15),%r10 ; safepoint poll, part 1
8.42% │ ...c17: add $0x1,%rbp ; ops++
0.43% │ ...c1b: test %eax,(%r10) ; safepoint poll, part 2
46.96% │ ...c1e: test %r8d,%r8d ; are we done? spin back if not.
0.02% ╰ ...c21: je ...c00
You cannot even see the blackhole code anywhere, except in extended disassembly annotation, but its effect is there: the computation is preserved. -prof perfnorm
is also happier:
Benchmark Mode Cnt Score Error Units
XplusY.test_return avgt 25 0.963 ± 0.042 ns/op
XplusY.test_return:L1-dcache-loads avgt 5 5.029 ± 0.170 #/op
XplusY.test_return:L1-dcache-stores avgt 5 0.001 ± 0.002 #/op
XplusY.test_return:branches avgt 5 1.006 ± 0.019 #/op
XplusY.test_return:cycles avgt 5 2.569 ± 0.108 #/op
XplusY.test_return:instructions avgt 5 8.043 ± 0.182 #/op
No stores anymore, there are only 6 additional instructions that carry the infrastructure. The whole benchmark is able to succeed in less than 3 cycles and less than 1 ns, and that involves 5 L1 accesses, 3 of which are infrastructural ones.[4]
This makes explicit Blackhole
uses more convenient too, for example when doing loops:
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class CPI_Floor {
@Param({"1000000"})
private int count;
@Benchmark
public void test(Blackhole bh) {
for (int c = 0; c < count; c += 10000) {
for (int k = 0; k < 10000; k++) {
int v = k + k + k;
bh.consume(v);
}
}
}
}
On TR 3970X, this hits the CPI floor or ~0.16 clks/insn or IPC ceiling of ~6 insn/clk! In fact, it appears that the whole inner loop over "k" executes in exactly one cycle!
Benchmark (count) Mode Score Error Units
CPI_Floor.test 1000000 avgt 273422.337 ± 12722.427 ns/op
CPI_Floor.test:CPI 1000000 avgt 0.169 clks/insn
CPI_Floor.test:IPC 1000000 avgt 5.907 insns/clk
CPI_Floor.test:branches 1000000 avgt 1003135.103 #/op
CPI_Floor.test:cycles 1000000 avgt 1022821.963 #/op
CPI_Floor.test:instructions 1000000 avgt 6042142.469 #/op