About

"JVM Anatomy Park" is the on-going mini-post series, where every post is slated to take 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. So, the evidence and discussion here are anecdotal, not actually reviewed for errors, consistency, writing style, syntactic and semantic errors, duplicates, or consistency. Use and/or trust this at your own risk.

Aleksey Shipilёv, JVM/Performance Geek, redhat logo
Shout out at Twitter: @shipilev
Questions, comments, suggestions: aleksey@shipilev.net

Questions

All these questions have the same answer.

Theory

Suppose you have the managed runtime like JVM, and you need to stop the Java threads occasionally to run some runtime code. For example, you want to do the stop-the-world GC. You can wait for all threads to eventually call into JVM, for example, ask for allocation (usually, a TLAB refill), or enter some native method (where transition to native would capture it), or do something else. But that is not guaranteed to happen! What if the thread is currently running in busy-loop of some kind, never doing anything special?

Well, on most machines, stopping the running thread is actually simple: you can send it a signal, force processor interrupt, etc. to make it stop what the thread is doing and transfer control to somewhere else. However, it usually not enough for the Java thread to stop at arbitrary points, especially if you want the precise garbage collection. There, you want to know what is in the registers and stack, in case those values are actually object references you need to deal with. Or, if you want to unbias the lock, you want to have precise information about the state of the thread and acquired locks. Or, if you deoptimize the method, you really want to do it from the safe location without losing already executed part of the code and/or temporary values.

Therefore, modern JVMs, like Hotspot, implement the cooperative scheme: threads ask every so often if they should transfer the control to VM, at some known points in their lifetime, when their state is known. When all threads stop at those known points, the VM is said to reach the safepoint. The pieces of code that check for safepoint requests are therefore known as safepoint polls.

The implementation needs to satisfy the interesting tradeoff: safepoint polls almost never fire, so they should be very efficient when not triggered. Can we glimpse it in the experiments?

Practice

Consider this simple JMH benchmark:

import org.openjdk.jmh.annotations.*;

import java.util.concurrent.TimeUnit;

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class EmptyBench {
    @Benchmark
    public void emptyMethod() {
        // This method is intentionally left blank.
    }
}

You might think this benchmark measures the empty method, but in reality it measures the minimal infrastructure code that services the benchmark: counts the iterations and waits for the iteration time to be over. Fortunately, that piece of code is rather fast, and so it can be dissected in full with the help of -prof perfasm.

This is out-of-the-box OpenJDK 8u191:

3.60%  ↗  ...a2: movzbl 0x94(%r8),%r10d       ; load "isDone" field
0.63%  │  ...aa: add    $0x1,%rbp             ; iterations++;
32.82% │  ...ae: test   %eax,0x1765654c(%rip) ; global safepoint poll
58.14% │  ...b4: test   %r10d,%r10d           ; if !isDone, do the cycle again
       ╰  ...b7: je     ...a2

The empty method got inlined, and everything evaporated out of it, only the infrastructure remains.

See that "global safepoint poll"? When safepoint is needed, JVM would arm the "polling page",[1] so any attempt to read that page would trigger the segmentation fault (SEGV). When SEGV finally fires from this safepoint poll, the control would be passed to any existing SEGV handlers first, and JVM has one ready! See, for example, how JVM_handle_linux_signal does it.

The goal of all those tricks is to make the safepoint polls as cheap as possible, because they need to happen in many places, and they almost always do not fire. For this reason, the test %eax, (addr) is used: it has no effects when safepoint poll is not triggered.[2] It is also has very compact encoding, "only" 6 bytes on x86_64. The polling page address is fixed for a given JVM process, so the code generated by JIT in that process can use RIP-relative addressing: it says that the page is at given offset from the current instruction pointer, saving the need to spend precious bytes encoding the absolute 8-byte address.

There is also normally a single polling page that handles all threads at once, so generated code does not have to disambiguate which thread is currently running. But what if VM wants to stop individual threads? That is the question answered by JEP-312: "Thread-Local Handshakes". It provides the VM capability to trigger the handshake poll for the individual thread, which is currently implemented by assigning the individual polling page for each thread, and poll instruction reading that page address from thread-local storage.[3][4]

This is out-of-the-box OpenJDK 11.0.1:

0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d   ; load "isDone" field
0.19%  │  ...78: mov    0x108(%r15),%r11  ; reading the thread-local poll page addr
25.62% │  ...7f: add    $0x1,%rbp         ; iterations++;
35.10% │  ...83: test   %eax,(%r11)       ; thread-local handshake poll
34.91% │  ...86: test   %r10d,%r10d       ; if !isDone, do the cycle again
       ╰  ...89: je     ...70

This is purely a runtime consideration, so this can be disabled with -XX:-ThreadLocalHandshakes, and the generated code would then be the same as in 8u191. This explains why this benchmark performs differently on 8 and 11 (let us run it under -prof perfnorm right away):

Benchmark                              Mode  Cnt  Score   Error  Units

# 8u191
EmptyBench.test                        avgt   15   0.383 ±  0.007  ns/op
EmptyBench.test:CPI                    avgt    3   0.203 ±  0.014   #/op
EmptyBench.test:L1-dcache-load-misses  avgt    3  ≈ 10⁻⁴            #/op
EmptyBench.test:L1-dcache-loads        avgt    3   2.009 ±  0.291   #/op
EmptyBench.test:cycles                 avgt    3   1.021 ±  0.193   #/op
EmptyBench.test:instructions           avgt    3   5.024 ±  0.229   #/op

# 11.0.1
EmptyBench.test                        avgt   15   0.590 ±  0.023  ns/op ; +0.2 ns
EmptyBench.test:CPI                    avgt    3   0.260 ±  0.173   #/op
EmptyBench.test:L1-dcache-loads        avgt    3   3.015 ±  0.120   #/op ; +1 load
EmptyBench.test:L1-dcache-load-misses  avgt    3  ≈ 10⁻⁴            #/op
EmptyBench.test:cycles                 avgt    3   1.570 ±  0.248   #/op ; +0.5 cycles
EmptyBench.test:instructions           avgt    3   6.032 ±  0.197   #/op ; +1 instruction

# 11.0.1, -XX:-ThreadLocalHandshakes
EmptyBench.test                        avgt   15   0.385 ±  0.007  ns/op
EmptyBench.test:CPI                    avgt    3   0.205 ±  0.027   #/op
EmptyBench.test:L1-dcache-loads        avgt    3   2.012 ±  0.122   #/op
EmptyBench.test:L1-dcache-load-misses  avgt    3  ≈ 10⁻⁴            #/op
EmptyBench.test:cycles                 avgt    3   1.030 ±  0.079   #/op
EmptyBench.test:instructions           avgt    3   5.031 ±  0.299   #/op

So, thread-local handshakes add another L1-hitting load, which costs around half a cycle. This also gives us some ground to estimate the cost of the safepoint poll itself: it is the L1-hitting load itself, and it probably takes another half a cycle.

Observations

Safepoint and handshake polls are interesting bits of trivia in managed runtime implementations. They are frequently visible on hotpath in the generated code, and they sometimes affect the performance, especially in the tight loops. Yet, their existence is necessary for runtime to implement important features like precise garbage collection, locking optimizations, deoptimization, etc.

There are lots of safepoint-related optimizations which we shall discuss separately.


1. In Linux/POSIX case, calling mprotect(PROT_NONE) is enough.
2. Well, almost. On x86, it changes flags, but next instructions would overwrite them anyway, and we only need to take care of never emitting the safepoint poll between real test and the associated jCC.
3. Thread-local storage is the piece of native data that is accessible per thread. On many platforms, where register pressure is not very high, generated code always have it in a register. On x86_64, it is usually %r15.
4. Technically, stopping a subset of threads does not get us to a "safepoint" anymore. But, when thread-local handshakes are enabled, the safepoint can be reached by handshaking with all threads. This covers both "safepoint" and "handshake" cases wholesale.