On The Fence With Dependencies

Aleksey Shipilëv, @shipilev, aleksey@shipilev.net

Note	This post is also available in ePUB and mobi.

Preface

Programming languages such as Java expose relatively high-level abstractions that must be converted to lower-level constructs that can be executed by underlying hardware. This translation is handled by compilers, which must emit instructions that account for peculiarities and idiosyncrasies of the underlying hardware. Higher-level programmers are usually oblivious of these, allowing them to focus on the problems of their particular domain. This is a separation of duties: system programmers are trying to build efficient low-level tools, application programmers try to leverage them.

This post is one of examples what system programmers are doing undercover. We will cover the recent update to volatile fences instruction selection in HotSpot, revisit the rationale for it, and generally highlight how one would approach the change like this. As a good tradition, we will take some diversions into benchmarking methodology. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then may do that first before reading the rest of this post for the best experience.

Thanks to Peter Zotov, Dmitry Groshev, Nitsan Wakart, and Jonathan Yu for reviews and corrections!

Theory

Model

The volatile keyword has a special meaning in Java: it denotes the variables that yield synchronization actions from their accesses. See Java Memory Model Pragmatics for more discussion about the behavioral parts of the spec. If you feel dizzy on volatile behavior, go read that post first.

How would a runtime/hardware implementation handle the spec requirements? JSR 133 Cookbook for Compiler Writers describes the conservative approach for a sample implementation to take. It introduces the notion of XY barriers, where X and Y are one of {Load, Store}. XY barrier means the operations of type X are not reordered with the operations of type Y. This is a very handy abstraction when you are handling the program graphs in the compilers: barriers are the "fences" which prevent the operations of given type from floating through the fences in a given direction. Hardware also treats fences as such.

How would a particular implementation handle the volatile semantics with memory fences? Here is how, given volatile int x, one can emit the volatile store as:

<other ops>
[StoreStore]
[LoadStore]
x = 1; // volatile store

Notice how volatile store is now the releasing store: all operations prior to volatile store are committed before it. That means once you have properly observed the volatile value, you also have a chance to observe the prior stores — we are building up the implementation of happens-before here.

The second part of that is the volatile load:

int t = x; // volatile load
[LoadLoad]
[LoadStore]
<other ops>

See how the load is now the acquiring load, i.e. all operations wait their execution until the volatile load is succeeded.

Now, we also need to guarantee sequential consistency for synchronization actions, or, in other words, guarantee that volatile operations themselves are not reordered.

Notice that, in the example above, if a volatile load follows the volatile store, sequential consistency is not guaranteed, because there is no memory fence that would prevent the volatile load from being performed first.

<other ops>
[StoreStore]
[LoadStore]
x = 1; // volatile store
[StoreLoad] // Case (a): Guard after volatile stores

...

[StoreLoad] // Case (b): Guard before volatile loads
int t = x; // volatile load
[LoadLoad]
[LoadStore]
<other ops>

Since volatile loads clearly dominate most of the programs, sane implementations choose the case (a), emitting the StoreLoad barrier after each volatile store.

Hardware

These barriers are treated by compilers as the intentions to break the harmful re-orderings. When it comes to hardware, it turns out some hardware already guarantees quite a lot. For example, one could open Intel Software Developer Manual, and read that in most x86 implementations:

"" * Reads are not reordered with other reads. [Translation: LoadLoad can be no-op] * Writes are not reordered with older reads. [Translation: LoadStore can be no-op] * Writes to memory are not reordered with other writes… [Translation: StoreStore can be no-op] ""

— Intel Software Development Manual; Vol 3A; 8.2.1

Therefore, it turns out, after we are done treating the barriers in compilers, the only barrier we need to communicate to x86 hardware (= emit in the machine code) is StoreLoad. And this is where it starts to get funny. x86 declares the handy rules:

"" * Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions. * Reads cannot pass earlier LFENCE and MFENCE instructions. * Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions. * LFENCE instructions cannot pass earlier reads. * SFENCE instructions cannot pass earlier writes. * MFENCE instructions cannot pass earlier reads or writes. ""

— Intel Software Development Manual; Vol 3A; 8.2.1

Looking at these, one can think mfence is a good candidate for StoreLoad barrier. However, there are also I/O instructions, lock-prefixed instructions and other serializing instructions that have a similar effect. Dave Dice had an observation years ago that using lock addl %(rsp), 0 is a better solution for StoreLoad barrier. This comes as a surprise that a dedicated mfence instruction is a bad choice, but it has an explanation: mfence orders all reads/writes, including those coming to/from non-ordinary memory^[1], ordering asynchronous memory errors, etc. When we are dealing with the ordinary memory, lightweight locked instructions may preferable to a full mfence. lock addl-ing against the top of the stack is arguably good since the stack top is: a) distinct for every thread; b) most likely already in closest-level cache.

Valhalla Use Case

Enter Project Valhalla, and more noticeably, Paul Sandoz's approach to bring Enhanced Volatiles to Java, through the mechanism dubbed as "VarHandles". Paul maintains a set of nice benchmarks to drive the development of this feature, and Paul handed me over a benchmark and said: "I don’t understand why this benchmark behaves in this particular way".

We run it in multiple configurations, and indeed the scores are all over the place, and tiered compilations seems to have an effect, and compiler bitness seems to have an effect, and inlining seems to have an effect. Then, we look into the perfasm profile for the slowest execution mode (twice as slow from the best), and see this…

   clks    insns
  1.61%             0x00007f63d8a6455e: mov    %r10,(%rcx)        ; reference store
           0.00%    0x00007f63d8a64561: mov    %rcx,%r10          ; <card mark>
           0.01%    0x00007f63d8a64564: shr    $0x9,%r10
                    0x00007f63d8a64568: movb   $0x0,(%r14,%r10,1)
  1.73%    2.90%    0x00007f63d8a6456d: lock addl $0x0,(%rsp)     ; StoreLoad barrier
 41.00%   93.96%    0x00007f63d8a64572: mov    (%rsp),%rcx        ; reading the spilled value from the stack
 26.39%    1.48%    0x00007f63d8a64576: mov    0x68(%rcx),%r8     ;

Notice anything? I spy with my little eye a vanilla read-after-write data dependency via the %(rsp). You would not notice that until hardware counters point that one out as the performance bottleneck. The original premise of addl-ing against the stack top backfired! We actually have something usable on stack, but in the vicinity of emitted StoreLoad barriers, those usages are in danger of being caught in a needless dependency chain.

This manifests very nicely in a tight loop where there are plenty of volatile stores which are dangerously close to stack operations. However, this effect is intermittent, and heavily dependent on the particular (mis)compilation decisions taken in this code. Touching the compiler a little can completely obliterate the effect if stack usages are padded away from the StoreLoad barrier — this explains the wild performance swings we observed with different compiler options.

StoreLoad Barrier and Stack Usages

Focused Benchmark

This is where nanobenchmarks come very handy in quantifying the costs. Constructing a test case for such an intricate effect causes some pain, but it is manageable if you have an understanding how specific compilers treat your code. It is akin to putting a tree log into a grinding machine in such a way it produces a usable antique chair at the other end — not entirely impossible, but quite hard. Here is an example of a very cool and sharp-edgy benchmark used to quantify the issue (tracked as JDK-8050147):

@State(Scope.Thread)
public class VolatileBarrierBench {

    volatile int sink;
    Integer src;

    private int c;

    @Param("0")
    private int backoff;

     /*
      The benchmark below is rather fragile:
        - we pass the object to non-inlinable method, and at least on Linux x86_64
          calling convention the argument gets to %(rsp+0), which is important for this test;
        - looping() is non-inlinable to prevent loop unrolling effects;
        - consumeCPU allows for tunable backoffs;
        - unboxing value before entering the consumeCPU allows to split the *load* from %(rsp),
          and the subsequent barrier;

      Use with great care, and mostly for producing profiled assemblies.
     */

    @Setup
    public void setup() throws NoSuchFieldException {
        src = 42;
    }

    @Benchmark
    public void testVolatile() {
        testWith_Volatile(src);
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void testWith_Volatile(Integer v1) {
        c = 0;
        do {
            int v = v1;
            Blackhole.consumeCPU(backoff);
            sink = v;
        } while (looping());
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    private boolean looping() {
        return c++ < 100;
    }
}

Since volatile stores are heavy-weight ops, and we also know that the data dependency amortizes in the presence of intervening operations, we want to quantify the costs with different backoffs. In order to test multiple StoreLoad strategies, we try a few options (see JDK-8050149):

mfence: mostly to figure if the observation from 5 years before is no longer valid, and we should just switch to mfence to evade the effect
lock addl %(rsp), 0: as the baseline
lock addl %(rsp-8), 0: as the attempt to de-couple the data dependency through %(rsp)
lock addl %(rsp-CacheLine), 0: as the attempt to de-couple the cache line
lock addl %(rsp-RedZone), 0: as the attempt to quantify the cost of the large step away

Experimental Results

The benchmark should obviously be run with different backoffs, and with different number of measurement threads. Also, since we are dealing with a micro-architectural effect here, we have to quantify the effect on multiple platforms. Let us highlight a few interesting platforms in alphabetical order:

AMD Opteron (Abudhabi)

There, we can see a few things already:

The performance is consistent across the number of worker threads
The performance for each particular mode is consistent across the runs
One can notice the gap between lock addl %(rsp), 0 and other lock addl-s is rapidly decreasing with backoff; this is because the distance between the stack usages and the lock-prefixed instruction grows with the backoff
mfence is indeed much slower than lock-prefixed instructions

This data tells that any reasonable offset from %rsp is good.

AMD Opteron (Interlagos)

To confirm this, another data point from another AMD microarchitecture, which yields the same result, and the same conclusion. This additionally verifies our experiment is internally consistent.

Intel Atom (Silverthorne)

When dealing with synchronization, it is advisable to check the impact on lower-end systems as well. On dual-core Atom processor, there is no difference in either case, because the CPU is not sophisticated enough to aggressively reorder memory instructions.

Intel Core (Haswell)

And now, the new Haswell microarchitecture from Intel. One can notice a few distinct things vs. AMD result:

The effect for %(rsp) dependency is much worse, ad so we gain a lot with offsetting %(rsp). The effect is also drags further with higher backoffs.
There is an interesting speckle and slightly lower performance with lock addl %(rsp-8), and it is consistent in this experiment.

Digging the generated assembly for green line speckle yields an interesting discovery: it looks like we call into looping() so fast, we actually collide with a callee save below the %rsp! Oh well, small offsets are not as safe as we thought then.

Fixing The Issue

Since we already have the code which selects the different instruction sequence for StoreLoad barrier, we can just switch to the best candidate, and be done with it. The best candidate is to lock addl-ing further below the %(rsp): it is still distinct for all threads, it is still adjacent to hot memory, and it has enough room to dodge the unlucky memory dependencies.

However, accessing memory below the stack pointer generally risks causing a stack overflow. The VM handles stack overflow by framing the stack arena with inaccessible guard pages, and intercepting the accesses to those guard pages to detect the case where a stack overflow occurred. That also requires and additional logic of actually probing the stack when putting a new activation record on it: by touching the stack down trying to trigger the guard page fault. This mechanism has a fancy name of "stack banging", and if we want to do anything below stack pointer, we need to communicate the additional banging to the runtime.

Therefore, the actual patch that went into the VM code is a little bit more complicated. Of course, that patch was verified on multiple realistic workloads in the course of regular performance testing, and so far it performs well.

Conclusion

Based on the exploration discussed in this article, we can draw the following conclusions:

Deep analysis of benchmark results can uncover the unexpected things that seem obvious in hindsight.
Nanobenchmarks are good to quickly diagnose and quantify the microarchitectural behavior. In this case, we jumped from the completely puzzling case of performance jitter to the hypothesis, and then to the candidate solution in a matter of few hours (of expert time, YMMV).
The decisions made when developing runtimes and compilers have to be re-evaluated periodically to determine whether they remain the most suitable. Since we can’t afford to go over all the decisions we made in millions lines of code continuously, we have to keep the Night Watch up to track if any evidence against the current approach pops up. Keep an eye open, and may the Force be with you.
This particular thing about StoreLoad interaction with stack usages was referenced by Doug Lea and Dave Dice as an example that choosing the "safe" temporary location in memory is harder than we thought, and maybe we should try to just emit the xchg against memory to do a volatile store. xchg against the memory operand assumes lock prefix on x86, and therefore bears the needed effect without requiring the explicit StoreLoad barrier at all. This will require some careful prototyping in current HotSpot, stay tuned.

1. "Ordinary" here means Write-Back (WB) memory, while non-ordinary means Write-Combining (WC) and Uncacheable (UC) memory types, plus any other (asynchronous) way to deal with memory