JVM Anatomy Quark #27: Compiler Blackholes

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Questions

How does JMH avoids dead-code elimination for nano-benchmarks? Is there an implicit or explicit compiler support?

Theory

Optimizing compilers are good at optimizing simple stuff. For example, if there is a computation that is not observable by anyone, it can be deemed "dead code" and eliminated.

It is usually a good thing, until you run benchmarks. There, you want the computation, but you don’t need the result. In essence, you observe the "resources" taken by the benchmark, but there is no easy way to argue this with a compiler.

So a benchmark like this:

int x, y;

@Benchmark
public void test_dead() {
  int r = x + y;
}

…would be routinely compiled like this:^[1]

  1.72%  ↗  ...370: movzbl 0x94(%r9),%r10d  ; load $isDone
  2.06%  │  ...378: mov    0x348(%r15),%r11 ; safepoint poll, part 1
 27.91%  │  ...37f: add    $0x1,%rbp        ; ops++;
 28.56%  │  ...383: test   %eax,(%r11)      ; safepoint poll, part 2
 33.43%  │  ...386: test   %r10d,%r10d      ; are we done? spin back if not.
         ╰  ...389: je     ...370

That is, only the benchmark infrastructure remains, with no actual x + y in sight. That code was dead, and it was eliminated.

Pure Java Blackholes

Since forever, JMH provides the way to avoid dead-code elimination by accepting the result from the benchmark. Under the hood, it is done by feeding that result into a Blackhole, that can also be used directly in some cases.

In short, the Blackhole has to achieve a single side effect on the incoming argument: pretend it is used. The Blackhole implementation notes describe what the Blackhole implementor has to deal with when trying to cooperate with compiler. Implementing it efficiently is an fine exercise in near-JVM engineering.

Anyhow, all that mess is hidden from JMH users, so they can just do:

int x, y;

@Benchmark
public int test_return() {
  return x + y;
}

If you look at the generated code, though, you would see that both the computation and the Blackhole code is there:

main loop:
  2.09% ↗ ...e32: mov    0x40(%rsp),%r10   ; load $this
  7.46% │ ...e37: mov    0x10(%r10),%edx   ; load $this.x
  0.64% │ ...e3b: add    0xc(%r10),%edx    ; add $this.y
  2.11% │ ...e3f: mov    0x38(%rsp),%rsi   ; call Blackhole.consume
  1.74% │ ...e44: data16 xchg %ax,%ax
  6.52% │ ...e47: callq  ...a80
 18.37% │ ...e4c: mov    (%rsp),%r10
  1.50% │ ...e50: movzbl 0x94(%r10),%r11d  ; load $isDone
  2.85% │ ...e58: mov    0x348(%r15),%r10  ; safepoint poll, part 1
  6.74% │ ...e5f: add    $0x1,%rbp         ; ops++
  0.62% │ ...e63: test   %eax,(%r10)       ; safepoint poll, part 2
  0.66% │ ...e66: test   %r11d,%r11d       ; are we done? spin back if not.
        ╰ ...e69: je     ...e32

Blackhole.consume:
  2.34%   ...040: mov    %eax,-0x14000(%rsp) ; too
  9.14%   ...047: push   %rbp                ; lazy
  0.64%   ...048: sub    $0x20,%rsp          ; to
  3.38%   ...04c: mov    %edx,%r11d          ; cross-reference
  6.66%   ...04f: xor    0xb0(%rsi),%r11d    ; this
  0.68%   ...056: mov    %edx,%r8d           ; with
  1.76%   ...059: xor    0xb8(%rsi),%r8d     ; the
  1.62%   ...060: cmp    %r8d,%r11d          ; actual
        ╭ ...063: je     ...078              ; Blackhole
  7.22% │ ...065: add    $0x20,%rsp          ; code
  0.35% │ ...069: pop    %rbp
  2.01% │ ...06a: cmp    0x340(%r15),%rsp
        │ ...071: ja     ...094
  8.53% │ ...077: retq
        ↘ ...078: mov    %rsi,%rbp

Not surprisingly, the Blackhole costs dominate such a tiny benchmark. With -prof perfnorm, we can see how bad it is:

Benchmark                            Mode  Cnt   Score    Error  Units
XplusY.test_return                   avgt   25   3.288 ±  0.032  ns/op
XplusY.test_return:L1-dcache-loads   avgt    5  13.092 ±  0.487   #/op
XplusY.test_return:L1-dcache-stores  avgt    5   3.031 ±  0.076   #/op
XplusY.test_return:branches          avgt    5   5.031 ±  0.089   #/op
XplusY.test_return:cycles            avgt    5   8.781 ±  0.351   #/op
XplusY.test_return:instructions      avgt    5  27.162 ±  0.489   #/op

That is, our "payload" is only 2 instructions, yet the whole benchmark takes another 25 instructions on top of them! Yes, modern CPUs can execute that whole bunch of instructions in about 9 cycles here, but it is still too much work. To add insult to injury, the calling code and related stack management introduced stores.

The benchmark itself takes about 3.2 ns/op, which puts a lower limit on the effects we can reliably measure.

Compiler Blackholes

Luckily, we can ask a more direct cooperation from the compiler, with the use of compiler blackholes. Those are implemented in OpenJDK 17 with JDK-8259316, with the plan to backport it to 11u as well. Compiler blackholes are instructing the compilers to carry all arguments through the optimization phases, and then finally drop them when emitting the generated code. Then, as long as hardware itself does not provide surprises to us, we should be good.^[2]

They are supposed to work transparently for JMH users, but since the whole thing is experimental, at this time JMH users are required to opt-in to compiler blackholes with -Djmh.blackhole.mode=COMPILER and then check the generated code for correctness.^[3] Indeed, using compiler blackholes with our benchmark, we can see that the computation is still there, and there is no Blackhole call anymore!

  8.95% ↗ ...c00: mov    0x10(%r11),%r10d  ; load $this.x
  0.36% │ ...c04: add    0xc(%r11),%r10d   ; add $this.y
        │                                  ; (AND COMPILER BLACKHOLE IT)
  0.94% │ ...c08: movzbl 0x94(%r14),%r8d   ; load $isDone
 26.76% │ ...c10: mov    0x348(%r15),%r10  ; safepoint poll, part 1
  8.42% │ ...c17: add    $0x1,%rbp         ; ops++
  0.43% │ ...c1b: test   %eax,(%r10)       ; safepoint poll, part 2
 46.96% │ ...c1e: test   %r8d,%r8d         ; are we done? spin back if not.
  0.02% ╰ ...c21: je     ...c00

You cannot even see the blackhole code anywhere, except in extended disassembly annotation, but its effect is there: the computation is preserved. -prof perfnorm is also happier:

Benchmark                            Mode  Cnt   Score    Error  Units
XplusY.test_return                   avgt   25   0.963 ±  0.042  ns/op
XplusY.test_return:L1-dcache-loads   avgt    5   5.029 ±  0.170   #/op
XplusY.test_return:L1-dcache-stores  avgt    5   0.001 ±  0.002   #/op
XplusY.test_return:branches          avgt    5   1.006 ±  0.019   #/op
XplusY.test_return:cycles            avgt    5   2.569 ±  0.108   #/op
XplusY.test_return:instructions      avgt    5   8.043 ±  0.182   #/op

No stores anymore, there are only 6 additional instructions that carry the infrastructure. The whole benchmark is able to succeed in less than 3 cycles and less than 1 ns, and that involves 5 L1 accesses, 3 of which are infrastructural ones.^[4]

This makes explicit Blackhole uses more convenient too, for example when doing loops:

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class CPI_Floor {

    @Param({"1000000"})
    private int count;

    @Benchmark
    public void test(Blackhole bh) {
        for (int c = 0; c < count; c += 10000) {
            for (int k = 0; k < 10000; k++) {
                int v = k + k + k;
                bh.consume(v);
            }
        }
    }
}

On TR 3970X, this hits the CPI floor or ~0.16 clks/insn or IPC ceiling of ~6 insn/clk! In fact, it appears that the whole inner loop over "k" executes in exactly one cycle!

Benchmark                    (count)  Mode        Score   Error         Units
CPI_Floor.test               1000000  avgt   273422.337 ± 12722.427     ns/op
CPI_Floor.test:CPI           1000000  avgt        0.169             clks/insn
CPI_Floor.test:IPC           1000000  avgt        5.907             insns/clk
CPI_Floor.test:branches      1000000  avgt  1003135.103                  #/op
CPI_Floor.test:cycles        1000000  avgt  1022821.963                  #/op
CPI_Floor.test:instructions  1000000  avgt  6042142.469                  #/op

Conclusion

The compiler blackholes are great for low-level performance investigations. Try to use them, check they do what you want, show the success and failure stories, and hope all this would culminate with a new default Blackhole modes in JMH.

1. Here and later, all disassembly code is produced with the help of JMH -prof perfasm.

2. I always wondered if it is possible for hardware to skip executing/retiring instructions that are obviously non-observable. As of today, I have never seen it in practice. If such a case manifests, then compiler blackholes would break.

3. JMH can technically detect the compiler blackhole support better. This UX improvement is left for future work.

4. It would be just 2 accesses, not 3, if not the Thread-Local Handshakes.