JVM Anatomy Quark #1: Lock Coarsening and Loops

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

It is known that Hotspot does lock coarsening optimizations that can effectively merge several adjacent locking blocks, thus reducing the locking overhead. It effectively converts this:

synchronized (obj) {
  // statements 1
}
synchronized (obj) {
  // statements 2
}

…into:

synchronized (obj) {
  // statements 1
  // statements 2
}

Now, the interesting question that was posed today is, does Hotspot do this optimization for loops? E.g. having:

for (...) {
  synchronized (obj) {
    // something
  }
}

…could it optimize into this?

synchronized (this) {
  for (...) {
     // something
  }
}

Theoretically, nothing prevents us from doing this. One might even see the optimization like a loop unswitching on steroids, only for locks. However, the downside is potentially coarsening the lock so much, the particular thread would hog the lock while executing a fat loop.

Experiment

The easiest way to approach answering this is to find the positive evidence current Hotspot does optimization like it. Luckily, it is pretty simple with JMH. It is useful not only for building the benchmarks, but also for the most important part of engineering, analyzing them. Let’s start with a simple benchmark:

@Fork(..., jvmArgsPrepend = {"-XX:-UseBiasedLocking"})
@State(Scope.Benchmark)
public class LockRoach {
    int x;

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void test() {
        for (int c = 0; c < 1000; c++) {
            synchronized (this) {
                x += 0x42;
            }
        }
    }
}

(full source here)

There are a few important tricks here:

Disabling biased locking with -XX:-UseBiasedLocking avoids longer warmups, because biased locking is not started up immediately, but instead waits 5 seconds through the initialization phase. (See BiasedLockingStartupDelay option).
Disabling inlining for @Benchmark method helps to separate it in the disassembly.
Adding up a magic number, 0x42 helps to quickly find the increment in the disassembly.

Running at i7 4790K, Linux x86_64, JDK EA 9b156:

Benchmark            Mode  Cnt      Score    Error  Units
LockRoach.test       avgt    5   5331.617 ± 19.051  ns/op

What can you tell from this number? You can’t tell anything, right? We need to look into what actually happened down below. -prof perfasm is very useful for this, as it shows you the hottest regions in the generated code. Running with default settings would tell that the hottest instructions are actual lock cmpxchg (compare-and-sets) that perform locking, and only print hot things around them. Running with -prof perfasm:mergeMargin=1000 to coalesce these hot regions into a solid picture, one would get this scary-at-first-sight piece of output.

Stripping it further down — the cascades of jumps are the locking/unlocking — and paying attention to the code that accumulates the most cycles (first column), we can see that the hottest loop looks like this:

 ↗  0x00007f455cc708c1: lea    0x20(%rsp),%rbx
 │          < blah-blah-blah, monitor enter >     ; <--- coarsened!
 │  0x00007f455cc70918: mov    (%rsp),%r10        ; load $this
 │  0x00007f455cc7091c: mov    0xc(%r10),%r11d    ; load $this.x
 │  0x00007f455cc70920: mov    %r11d,%r10d        ; ...hm...
 │  0x00007f455cc70923: add    $0x42,%r10d        ; ...hmmm...
 │  0x00007f455cc70927: mov    (%rsp),%r8         ; ...hmmmmm!...
 │  0x00007f455cc7092b: mov    %r10d,0xc(%r8)     ; LOL Hotspot, redundant store, killed two lines below
 │  0x00007f455cc7092f: add    $0x108,%r11d       ; add 0x108 = 0x42 * 4 <-- unrolled by 4
 │  0x00007f455cc70936: mov    %r11d,0xc(%r8)     ; store $this.x back
 │          < blah-blah-blah, monitor exit >      ; <--- coarsened!
 │  0x00007f455cc709c6: add    $0x4,%ebp          ; c += 4   <--- unrolled by 4
 │  0x00007f455cc709c9: cmp    $0x3e5,%ebp        ; c < 1000?
 ╰  0x00007f455cc709cf: jl     0x00007f455cc708c1

Huh. The loop seems to be unrolled by 4, and then locks are coarsened within those 4 iterations! Okay then, if that happens due to loop unrolling, we can quantify the performance benefits of doing this limited coarsening, but trimming down the unrolling with -XX:LoopUnrollLimit=1:

Benchmark            Mode  Cnt      Score    Error  Units

# Default
LockRoach.test       avgt    5   5331.617 ± 19.051  ns/op

# -XX:LoopUnrollLimit=1
LockRoach.test       avgt    5  20679.043 ±  3.133  ns/op

Whoa, 4x performance hit! That stands to reason, because we have already observed that the hottest things are lock cmpxchg from locking. Naturally, 4x coarsened lock means 4x better throughput. Very cool, we can claim success and move on? Not yet, we have to verify that disabling loop unrolling actually gives us what we want to compare against. perfasm seems to indicate it does the similar hot loop, but with a single stride.

 ↗  0x00007f964d0893d2: lea    0x20(%rsp),%rbx
 │          < blah-blah-blah, monitor enter >
 │  0x00007f964d089429: mov    (%rsp),%r10        ; load $this
 │  0x00007f964d08942d: addl   $0x42,0xc(%r10)    ; $this.x += 0x42
 │          < blah-blah-blah, monitor exit >
 │  0x00007f964d0894be: inc    %ebp               ; c++
 │  0x00007f964d0894c0: cmp    $0x3e8,%ebp        ; c < 1000?
 ╰  0x00007f964d0894c6: jl     0x00007f964d0893d2 ;

Ah, OK, everything checks out.

Observations

While lock coarsening does not work on the entire loop, another loop optimization — loop unrolling — sets up the stage for the regular lock coarsening, once the intermediate representation starts to look as if there are N adjacent lock-unlock sequences. This reaps the performance benefits, and helps to limit the scope of coarsening, to avoid over-coarsening over fat loops.