JVM Anatomy Quark #19: Lock Elision

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

I have heard that JVM bails out any compiler optimization with locks, so if I write synchronized, this is what JVM has to do! Right?

Theory

With current Java Memory Model, unobserved locks are not guaranteed to have any memory effects. Among other things, this means that synchronization on non-shared objects is futile, and thus runtime does not have to do anything there. It still might, but not really required, and this opens up optimization opportunities.

Therefore, if escape analysis figures out the object is non-escaping, compiler is free to eliminate synchronization. Is that observable in practice?

Practice

Consider this simple JMH benchmark. We increment something with and without synchronization on new object:

import org.openjdk.jmh.annotations.*;

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class LockElision {

    int x;

    @Benchmark
    public void baseline() {
        x++;
    }

    @Benchmark
    public void locked() {
        synchronized (new Object()) {
            x++;
        }
    }
}

If we run this test, and enable -prof perfnorm profiler right away, this is what we shall see:

Benchmark                                   Mode  Cnt   Score    Error  Units

LockElision.baseline                        avgt   15   0.268 ±  0.001  ns/op
LockElision.baseline:CPI                    avgt    3   0.200 ±  0.009   #/op
LockElision.baseline:L1-dcache-loads        avgt    3   2.035 ±  0.101   #/op
LockElision.baseline:L1-dcache-stores       avgt    3  ≈ 10⁻³            #/op
LockElision.baseline:branches               avgt    3   1.016 ±  0.046   #/op
LockElision.baseline:cycles                 avgt    3   1.017 ±  0.024   #/op
LockElision.baseline:instructions           avgt    3   5.076 ±  0.346   #/op

LockElision.locked                          avgt   15   0.268 ±  0.001  ns/op
LockElision.locked:CPI                      avgt    3   0.200 ±  0.005   #/op
LockElision.locked:L1-dcache-loads          avgt    3   2.024 ±  0.237   #/op
LockElision.locked:L1-dcache-stores         avgt    3  ≈ 10⁻³            #/op
LockElision.locked:branches                 avgt    3   1.014 ±  0.047   #/op
LockElision.locked:cycles                   avgt    3   1.015 ±  0.012   #/op
LockElision.locked:instructions             avgt    3   5.062 ±  0.154   #/op

Whoa, the tests perform exactly the same: timing is the same, the number of loads, stores, cycles, instructions are the same. With high probability, this means that the generated code is the same. Indeed it is, and looks like this:

14.50%   16.97%  ↗  incl   0xc(%r8)              ; increment field
76.82%   76.05%  │  movzbl 0x94(%r9),%r10d       ; JMH infra: do another @Benchmark
 0.83%    0.10%  │  add    $0x1,%rbp
 0.47%    0.78%  │  test   %eax,0x15ec6bba(%rip)
 0.47%    0.36%  │  test   %r10d,%r10d
                 ╰  je     BACK

The lock is completely elided, there is nothing left out of allocation, out of synchronization, nothing. If we supply JVM flag -XX:-EliminateLocks, or we disable EA with -XX:-DoEscapeAnalysis (that breaks every optimization that depends on EA, including lock elision), then locked counters would balloon up:

Benchmark                                   Mode  Cnt   Score    Error  Units

LockElision.baseline                        avgt   15   0.268 ±  0.001  ns/op
LockElision.baseline:CPI                    avgt    3   0.200 ±  0.001   #/op
LockElision.baseline:L1-dcache-loads        avgt    3   2.029 ±  0.082   #/op
LockElision.baseline:L1-dcache-stores       avgt    3   0.001 ±  0.001   #/op
LockElision.baseline:branches               avgt    3   1.016 ±  0.028   #/op
LockElision.baseline:cycles                 avgt    3   1.015 ±  0.014   #/op
LockElision.baseline:instructions           avgt    3   5.078 ±  0.097   #/op

LockElision.locked                          avgt   15  11.590 ±  0.009  ns/op
LockElision.locked:CPI                      avgt    3   0.998 ±  0.208   #/op
LockElision.locked:L1-dcache-loads          avgt    3  11.872 ±  0.686   #/op
LockElision.locked:L1-dcache-stores         avgt    3   5.024 ±  1.019   #/op
LockElision.locked:branches                 avgt    3   9.027 ±  1.840   #/op
LockElision.locked:cycles                   avgt    3  44.236 ±  3.364   #/op
LockElision.locked:instructions             avgt    3  44.307 ±  9.954   #/op

…and show the cost of allocation and trivial synchronization.

Observations

Lock elision is another optimization that is enabled by escape analysis, and it removes some superfluous synchronization. This is especially profitable when internally synchronized implementations are not escaping into the wild: then, we can dispense with synchronization completely! This is a Zen of compiler optimizations — if no one ever sees the synchronized lock, does it make a sound?