About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Question
I have heard that JVM bails out any compiler optimization with locks, so if I write synchronized
, this is what JVM has to do! Right?
Theory
With current Java Memory Model, unobserved locks are not guaranteed to have any memory effects. Among other things, this means that synchronization on non-shared objects is futile, and thus runtime does not have to do anything there. It still might, but not really required, and this opens up optimization opportunities.
Therefore, if escape analysis figures out the object is non-escaping, compiler is free to eliminate synchronization. Is that observable in practice?
Practice
Consider this simple JMH benchmark. We increment something with and without synchronization on new object:
import org.openjdk.jmh.annotations.*;
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class LockElision {
int x;
@Benchmark
public void baseline() {
x++;
}
@Benchmark
public void locked() {
synchronized (new Object()) {
x++;
}
}
}
If we run this test, and enable -prof perfnorm
profiler right away, this is what we shall see:
Benchmark Mode Cnt Score Error Units
LockElision.baseline avgt 15 0.268 ± 0.001 ns/op
LockElision.baseline:CPI avgt 3 0.200 ± 0.009 #/op
LockElision.baseline:L1-dcache-loads avgt 3 2.035 ± 0.101 #/op
LockElision.baseline:L1-dcache-stores avgt 3 ≈ 10⁻³ #/op
LockElision.baseline:branches avgt 3 1.016 ± 0.046 #/op
LockElision.baseline:cycles avgt 3 1.017 ± 0.024 #/op
LockElision.baseline:instructions avgt 3 5.076 ± 0.346 #/op
LockElision.locked avgt 15 0.268 ± 0.001 ns/op
LockElision.locked:CPI avgt 3 0.200 ± 0.005 #/op
LockElision.locked:L1-dcache-loads avgt 3 2.024 ± 0.237 #/op
LockElision.locked:L1-dcache-stores avgt 3 ≈ 10⁻³ #/op
LockElision.locked:branches avgt 3 1.014 ± 0.047 #/op
LockElision.locked:cycles avgt 3 1.015 ± 0.012 #/op
LockElision.locked:instructions avgt 3 5.062 ± 0.154 #/op
Whoa, the tests perform exactly the same: timing is the same, the number of loads, stores, cycles, instructions are the same. With high probability, this means that the generated code is the same. Indeed it is, and looks like this:
14.50% 16.97% ↗ incl 0xc(%r8) ; increment field
76.82% 76.05% │ movzbl 0x94(%r9),%r10d ; JMH infra: do another @Benchmark
0.83% 0.10% │ add $0x1,%rbp
0.47% 0.78% │ test %eax,0x15ec6bba(%rip)
0.47% 0.36% │ test %r10d,%r10d
╰ je BACK
The lock is completely elided, there is nothing left out of allocation, out of synchronization, nothing. If we supply JVM flag -XX:-EliminateLocks
, or we disable EA with -XX:-DoEscapeAnalysis
(that breaks every optimization that depends on EA, including lock elision), then locked
counters would balloon up:
Benchmark Mode Cnt Score Error Units
LockElision.baseline avgt 15 0.268 ± 0.001 ns/op
LockElision.baseline:CPI avgt 3 0.200 ± 0.001 #/op
LockElision.baseline:L1-dcache-loads avgt 3 2.029 ± 0.082 #/op
LockElision.baseline:L1-dcache-stores avgt 3 0.001 ± 0.001 #/op
LockElision.baseline:branches avgt 3 1.016 ± 0.028 #/op
LockElision.baseline:cycles avgt 3 1.015 ± 0.014 #/op
LockElision.baseline:instructions avgt 3 5.078 ± 0.097 #/op
LockElision.locked avgt 15 11.590 ± 0.009 ns/op
LockElision.locked:CPI avgt 3 0.998 ± 0.208 #/op
LockElision.locked:L1-dcache-loads avgt 3 11.872 ± 0.686 #/op
LockElision.locked:L1-dcache-stores avgt 3 5.024 ± 1.019 #/op
LockElision.locked:branches avgt 3 9.027 ± 1.840 #/op
LockElision.locked:cycles avgt 3 44.236 ± 3.364 #/op
LockElision.locked:instructions avgt 3 44.307 ± 9.954 #/op
…and show the cost of allocation and trivial synchronization.
Observations
Lock elision is another optimization that is enabled by escape analysis, and it removes some superfluous synchronization. This is especially profitable when internally synchronized implementations are not escaping into the wild: then, we can dispense with synchronization completely! This is a Zen of compiler optimizations — if no one ever sees the synchronized lock, does it make a sound?