Close Encounters of The Java Memory Model Kind

Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

This post is also available in ePUB and mobi.

Thanks to Gleb Smirnov, Vladimir Sitnikov, Alex Blewitt, Pavel Rappo, Doug Lea, Brian Goetz, Rory Hunter, Rafael Winterhalter, Paul Sandoz, Andrey Ershov and others for reviews, edits and helpful suggestions!

1. Introduction

Two years ago I painfully researched and built JMM Pragmatics talk and transcript, hoping it would highlight the particular dark corners of the Java Memory Models for those who cannot afford to spend years studying the formalisms, and deriving the actionable insights from them. JMM Pragmatics has helped many people, but there is still lots of confusion about what Memory Model guarantees, and what it does not.

In this post, we will try to follow up on particular misunderstandings about Java Memory Model, hopefully on the practical examples. The examples use the APIs from the jcstress suite, which are concise enough for a reader, and are runnable.

This is a fairly large piece of writing. In this age of attention-scarce Internet it would normally be presented as a series of posts, one section each. Let’s pretend we did that :) If you run out of time/patience to read through, bookmark, and restart from the section you’ve left at.

Most examples in this post came from public and private discussions we’ve had over the years. I would not claim they cover all memory model abuses. If you have an interesting case that is not covered here, don’t hesitate to drop me a mail, and we can see if it fits the narrative somewhere. concurrency-interest is the mailing list where you can discuss interesting cases in public.

1.1. Java Concurrency Stress Tests

The jcstress API is really simple, and best demonstrated by this trivial example:

@JCStressTest
@State
@Outcome(id = "1, 2", expect = ACCEPTABLE, desc = "T1 updated, then T2 updated.")
@Outcome(id = "2, 1", expect = ACCEPTABLE, desc = "T2 updated, then T1 updated.")
@Outcome(id = "1, 1", expect = ACCEPTABLE, desc = "Both T1 and T2 updated concurrently.")
class VolatileIncrementAtomicityTest {
  volatile int v;

  @Actor
  void actor1(IntResult2 r) {
    r.r1 = ++v;
  }

  @Actor
  void actor2(IntResult2 r) {
    r.r2 = ++v;
  }
}

The jcstress harness will concurrently execute, on instances of VolatileIncrementAtomicityTest, the methods actor1 and actor2. Given an instance of VolatileIncrementAtomicityTest one thread will be responsible for executing method actor1 exactly once, and another thread will be responsible for executing actor2 exactly once. Thus an instance will visited, eventually, by both threads, sometimes at the same time. Therefore, over the aggregation of many instances of VolatileIncrementAtomicityTest and executions producing results or samples, returned in instances of IntResult2, the test case above explores the atomicity of volatile increment.

If you run this test on just about any hardware, this would be the result:

[OK] net.shipilev.jmm.VolatileIncrementAtomicityTest
(fork: #1, iteration #1, JVM args: [-server])
Observed state   Occurrences   Expectation  Interpretation
          1, 1     1,543,069    ACCEPTABLE  Both T1 and T2 updated concurrently.
          1, 2    29,034,989    ACCEPTABLE  T1 updated, then T2 updated.
          2, 1    26,223,172    ACCEPTABLE  T2 updated, then T1 updated.

Notice how often the 1, 1 case is present — this is when both threads have met on the same volatile field, and non-atomically incremented it.

Even if the 1, 1 case were absent, it would not mean that volatile increment is atomic. Concurrency tests are inherently probabilistic, and can generally show that something fails sometimes, not that something always passes. Or, the absence of evidence is not evidence of absence.^[1]

1.2. Hardware and Runtime Modes

Since concurrency testing is probabilistic, we have three distinct setup problems to deal with:

The tests should generally run as fast as they possibly can, gathering more samples. The more samples we have, the more chances there are to detect at least one interesting outcome. While we could just run a single test for hours, it is impractical for thousands of tests. Therefore, the infrastructure has to be very fast. Notice in the example above we are getting tens of millions of samples for a short running test. That’s for 5 seconds run, which translates to 100ns per sample, mostly bottlenecked on the actual volatile stores. jcstress handles this by cleverly generating optimized runners for the tests.

In this post, we will use default jcstress modes, because they are sufficient to demonstrate most of the effects. For most examples, it is irrelevant how fast the machine is, although faster machines have more chances to reveal interesting behaviors.

Even if we translate the testcase program while keeping the operations in the exact order they are present in the source code, the hardware may still execute it out-of-order. This complicates testing, because the ordering rules differ among processor architectures. A test that passes on one platform may well fail on another, because the first platform induced stronger ordering guarantees.

In this post we will mostly use x86 and POWER as interesting architectures. The actual details about their micro-architectures are not relevant. It is enough to have their multi-core versions, and a decent JVM running on both. OpenJDK is freely available, buildable, and runnable on both platforms.

Optimizing compilers/runtimes may run programs in surprising ways. Therefore, general testing requires trying different sets of runtime modes to uncover interesting behaviors. Sometimes the interesting outcomes are produced in transient mode (e.g when half of the code is run in interpreted, and half in the compiled mode), since different parts of runtime may implement concurrency slightly differently. Sometimes compilers have a consistent compilation result within the allowed boundaries, i.e. sometimes they are not surprising enough. Using randomized compilation modes helps uncover more interesting behaviors. In HotSpot, we use the -XX:+StressLCM -XX:+StressGCM options, where available, to randomize instruction scheduling.

jcstress routinely runs tens of different configurations. In this post, we will only show the configurations with interesting cases.

1.3. Model Recap

Before we go in, let us take a quick recap of Java Memory Model. The more detailed explanation can be found in "JMM Pragmatics", so if reading this you have a nagging feeling of misunderstanding, stop reading this post, re-read JMM Pragmatics transcript, and then get back here.

The Big Picture: Java Memory Model (JMM) describes the allowed program outcomes by describing what executions are conforming. It does so by introducing actions, and orders over actions, and consistency rules that govern what actions+orders+constraints constitute a valid execution. If a program result can be explained by some valid execution, then this result is allowed under JMM.

There are several basic bits in the formalism:

Program order (PO): defines a total order of actions within each thread. This order provides a connection between the single-threaded execution of the program, and the actions it generates. Executions that have program orders inconsistent with the original programs cannot be used to reason about that particular program’s outcomes.
Synchronization order (SO): defines a total order over synchronization actions (which are explicitly enumerated in the spec: volatile read/writes, synchronized enter/exit, etc). It comes with two important consistency rules:
1. SO consistency: all reads that come later in the synchronization order see the last write to that location. This property disallows racy results over synchronization actions in conforming executions.
2. SO-PO consistency: the order of synchronization actions in SO within a single thread is consistent with PO in that thread. This means a conforming execution should have the order of SO to agree with PO.
SO consistency and SO-PO consistency mean that in all conforming executions synchronized actions appear sequentially consistent.
Synchronized-with order (SW): a suborder of SO that covers the pairs of synchronization actions that "see" each other. This order serves as the connection bridge between different threads.
Happens-before order (HB): the transitive closure of the union of PO and SW. Unlike SO, HB is a partial order, and only relates some actions, not every pair of actions. Also, unlike SO, HB is able to relate the actions that are not synchronization actions, which allows us to cover all important actions with ordering guarantees. HB comes with the following important consistency rule:
1. HB consistency: every read can see either the latest write in the happens-before order (notice the symmetry with SO here), or any other write that is not ordered by HB (this allows races).
Causality rules: additional verifications on otherwise conforming executions, to rule out causality loops. This is verified by a special process of "committing" the actions from the execution and verifying that no self-justification of actions takes place.
Final field rules: this is tangential to the rest of the model, and describes additional constraints imposed by final fields, e.g. additional happens-before ordering between final field stores and their associated reads.

Once again, if you don’t understand what is written in this section, go an re-read JMM Pragmatics transcript first. We will not stop at discussing the basics of the Java Memory Model here.

2. The General Misunderstandings

2.1. Myth: Machines Do What I Tell Them To Do

The first order of business is the confusion between the language specification, and what hits the real hardware. It is easy, nay comfortable, to read the language rules and think that it is exactly what the machine will do.

However, it is a very misleading way of thinking about the issue. Language specification describes the behavior of the abstract machine executing the program. It is the runtime’s job to emulate the behavior of the abstract machine. The point of contention here is that a compatible runtime is not obliged to compile the program exactly as it is written in the source code.

The actual requirement is much weaker: the runtime is obliged to produce results as if there is a compatible abstract machine execution that backs the results. What the runtime does to do the actual computation is up to the runtime. It’s all smoke and mirrors.

For example, if you write:

int m() {
  int a = 42;
  int b = 13;
  int r = a + b;
  return r;
}

…it would be remarkably odd to require that the language runtime actually does allocate storage for all three local variables, store values there, load them back, add them, etc. This whole method should be optimizable to something like this:

  mov %eax, 55;
  ret

In fact, it is optimizable in most languages, and it is allowed to happen because the observed result of the execution is one of the results of abstract machine execution. As long as programs cannot call runtime’s bluff (= detect something that language specification disallows) the runtime is free to do stuff under cover.

This apparent disconnect between the intent described in the high-level language and what happens in reality is the corner-stone of high-level languages' success. Abstracting away the mess of physical reality allows programmers to concentrate on stuff that matters: correctness, security, and pleasing customers with a pixel-perfect rendering.

When you debug a program, debuggers try to reconstruct the illusions to get e.g. step-by-step debugging, observing local variables values, etc. In Java, debuggers usually observe the state of abstract Java machine, rather than the inner workings of JVM. This sometimes requires deoptimizing the generated code, if the information associated with it is not enough to reconstruct the Java machine state.

Similarly, when JMM says (for example) that there are program actions tied into the synchronization order, it does not mean the actual physical implementation should be emitting those loads and stores to the machine code!

For example, if we have a program:

volatile int x;
void m() {
  x = 1;
  x = 2;
  System.out.println(x);
}

…this program is actually optimizable into:

 mov %eax, 2 # first argument
 call System_out_println

Even though the spec says about actions (x = 1), (x = 2), and (r1 = x) as synchronization actions tied into the synchronization order, blah blah blah, it does not mean the actual runtime has to do all the program operations. Most runtimes would though, because this analysis — whether someone should be able to observe (x = 1) — is generally rather complicated.^[2]

As long as the runtime can keep up the appearance of maintaining abstract machine semantics, it is free to do complex transformations. Lack of empirical evidence that the runtime performs some particular plausible transformation cannot serve as an evidence of absence of any similar transformation.

Writing reliable software should be based on the actual language guarantees, not on anecdotal observations of what language runtimes are doing today. Once you start relying on anecdotes, prepare to suffer.

2.2. Myth: JSR 133 Cookbook Is JMM Synopsis

Quite a few folks who get themselves burned by the abstract JMM rules, rest their gaze at JSR 133 Cookbook for the Compiler Writers. All those sweet, easy to understand barriers are much easier to grasp than the arcanery of the formal model. So, many are boldly suggesting Cookbook is a brief description (or even a short equivalent) of Java Memory Model.

But haven’t you read the Preface there?

…And many have not. While this guide is maintained to remain accurate, it is incomplete about some of these evolving details. … for compilers and JVMs … We cannot guarantee that the interpretations are correct.

JSR 133 Cookbook is one of the possible, yet conservative, sets of rules to implement JMM. One of the possible means that a conforming implementation does not have to follow the Cookbook, as long as it satisfies the JMM requirements. Conservative means it does not go into the intricacies of the model, and instead provides a very simple, yet coarse, implementation. It might be unnecessarily strong for practical use. We can go even deeper in our conservatism, and still arrive at JMM-conforming implementation: make sure JVM runs on a single core, or have a Global Interpreter Lock, and then concurrency is trivial.

The Cookbook was written to aid the actual compiler writers to quickly come up with a conforming implementation. Are you a compiler writer looking for implementation guidance? No? Thought so. Move along then. The bad thing that happens after you digest the JSR 133 Cookbook is that you start to believe in…

2.3. Myth: Barriers Are The Sane Mental Model

…while in fact, they are not: they are merely an implementation detail. The easiest example why barriers are not reliable as a mental model, is the following simple test case with two back-to-back synchronized statements:

@JCStressTest
@State
public class SynchronizedBarriers {
  int x, y;

  @Actor
  void actor() {
    synchronized(this) {
      x = 1;
    }
    synchronized(this) {
      y = 1;
    }
  }

  @Actor
  void observer(IntResult2 r) {
    // Caveat: get_this_in_order()-s happen in program order
    r.r1 = get_this_in_order(y);
    r.r2 = get_this_in_order(x);
  }
}

Naively, you may think the 1, 0 case is prohibited, because synchronized sections should execute in an order consistent with a program order.

Of course, without keeping reads in order, the result 1, 0 is trivially achievable. But this does not make an interesting test case. The actual test is clever about that: it uses the new VarHandles "opaque" access mode, which inhibits these optimizations and exposes the reads to hardware in the same order:^[3]

private static final VarHandle VH_X, VH_Y;

static {
    try {
        VH_X = MethodHandles.lookup().findVarHandle(Test.class,
                  "x", int.class); (1)
        VH_Y = MethodHandles.lookup().findVarHandle(Test.class,
                  "y", int.class); (1)
    } catch (Exception e) {
        throw new IllegalStateException(e);
    }
}

@Actor
public void observer(IntResult2 r) {
    r.r1 = (int) VH_Y.getOpaque(this); (2)
    r.r2 = (int) VH_X.getOpaque(this); (2)
}

1	Lookup `VarHandle` for the fields
2	Get the associated field value from `this` object, with "opaque" access mode

You may get a similar effect with non-inlined get_this_in_order() methods that would also be opaque to the optimizer today. Coupled with the hardware that does not reorder reads, you have the reads satisfied in the program order. You can emit a full barrier between the loads, if you want to be extra safe in the face of weaker hardware, although it will mud the waters with barrier interactions. The point of this example, however, is to see what is happening on the writer side, assuming that everything happens in order on the reader side. Do not overlook the writer side, chasing the technicalities on the reader side.

Let’s see what barriers tell us about code semantics. In pseudo-code, this will do:

void actor() {
  [LoadStore]  // between monitorenter and normal store
  x = 1;
  [StoreStore] // between normal store and monitorexit
  [StoreLoad]  // between monitorexit  and monitorenter

  [LoadStore]  // between monitorenter and normal store
  y = 1;
  [StoreStore] // between normal store and monitorexit
  [StoreLoad]  // between monitorexit  and monitorenter
}

void observer() {
  // Caveat: get_this_in_order()-s happen in program order
  r.r1 = get_this_in_order(y);
  r.r2 = get_this_in_order(x);
}

Yup, seems fine. x = 1 cannot go past y = 1, because it will meet barriers long before that.

However, the JMM itself allows observing 1, 0, because the reads of x and y are not tied in any ordering constraints, and therefore there exists a plausible execution that justifies observing 1, 0. More formally, whatever conforming execution you can imagine, the reads of x and y are not synchronization actions, and therefore SO rules do not apply to the induced actions. The reads are not tied into HB, and therefore no HB rules are preventing from reading the racy values. There are no causality loops in observing 1, 0 too.

Allowing this behavior in the model is intentional for two reasons. First of all, the hardware should be able to perform independent operations in whatever order it wants to maximize performance. Secondly, this enables interesting and important optimizations.

For instance, in the example above, we can coalesce back-to-back locks:

void actor() {
  synchronized(this) {
    x = 1;
  }
  synchronized(this) {
    y = 1;
  }
}

// ... becomes:

void actor() {
  synchronized(this) {
    x = 1;
    y = 1;
  }
}

…which improves performance (because lock acquisition is costly), and allows further optimizations within the synchronized block. Notably, since the writes of x and y are independent, we may allow hardware to execute them in an arbitrary order, or allow optimizers to shift them around.

If you run the example above on an actual JVM and hardware, this is what happens on x86 with JDK 9 "fastdebug" build (needed to gain access to instruction scheduling fuzzing):

[OK] net.shipilev.jmm.LockCoarsening
(fork: #1, iteration #1, JVM args: [-server, -XX:+UnlockDiagnosticVMOptions, -XX:+StressLCM, -XX:+StressGCM])
  Observed state   Occurrences              Expectation  Interpretation
            0, 0    43,558,372               ACCEPTABLE  All other cases are acceptable.
            0, 1        22,512               ACCEPTABLE  All other cases are acceptable.
            1, 0         1,565   ACCEPTABLE_INTERESTING  X and Y are visible in different order
            1, 1     1,372,341               ACCEPTABLE  All other cases are acceptable.

Notice the interesting case, that is our 1, 0. Surprise!

Disabling lock optimizations with -XX:-EliminateLocks trims down the number of occurrences of this interesting case to zero:

[OK] net.shipilev.jmm.LockCoarsening
(fork: #1, iteration #1, JVM args: [-server, -XX:+UnlockDiagnosticVMOptions, -XX:+StressLCM, -XX:+StressGCM, -XX:-EliminateLocks])
  Observed state   Occurrences              Expectation  Interpretation
            0, 0    52,892,632               ACCEPTABLE  All other cases are acceptable.
            0, 1       163,611               ACCEPTABLE  All other cases are acceptable.
            1, 0             0   ACCEPTABLE_INTERESTING  X and Y are visible in different order
            1, 1     1,825,907               ACCEPTABLE  All other cases are acceptable.

On POWER, the interesting case is present even without messing with instruction scheduling, because hardware guarantees are weaker:

      [OK] net.shipilev.jmm.LockCoarsening
    (fork: #1, iteration #1, JVM args: [-server])
  Observed state   Occurrences              Expectation  Interpretation
            0, 0     7,899,607               ACCEPTABLE  All other cases are acceptable.
            0, 1         4,089               ACCEPTABLE  All other cases are acceptable.
            1, 0           162   ACCEPTABLE_INTERESTING  X and Y are visible in different order
            1, 1       240,682               ACCEPTABLE  All other cases are acceptable.

This example does not mean it is possible to enumerate all "dangerous" optimizations and disable them. Modern optimizers work as complicated graph matching-and-crunching machines, and reliably disabling a particular kind of optimizations usually means disabling the optimizer completely.

There are other kinds of plausible optimizations around barriers that runtimes are making, or will choose to do in the future. Even JSR 133 Cookbook has the "Removing Barriers" section that gives a short outline of what elision techniques are readily available.

Given that, how can you trust barriers, if they are routinely removable?

Barriers are implementation details, not the behavioral specification. Explaining the semantics of concurrent code using them is dangerous at best, and keeps you tidally locked with a particular runtime implementation.

2.4. Myth: Reorderings And "Commit to Memory"

The second part of the confusion is the notion of committing to memory. The pre-Java 5 memory model had the notion of "thread local caches" and "main memory". "Flushing to main memory" had some meaning there. In the new post-Java 5 model this is not the case. However, while most people moved on, they still implicitly rely on "main memory" abstraction as the basis for their mental models. That naive mental model says: when operations finally commit to the memory, memory will serialize it.

This intuition is broken on two accounts. First, memory is already asynchronous and so serialization is not guaranteed. Second, most of the interesting things are happening on the micro-scale of cache coherency, not the "actual" main memory.

Here comes the "Independent Reads of Independent Writes" (IRIW) test. It is really simple:

@JCStressTest
@State
class IRIW {
  int x;
  int y;

  @Actor
  void writer1() {
    x = 1;
  }

  @Actor
  void writer2() {
    y = 1;
  }

  @Actor
  void reader1(IntResult4 r) {
    r.r1 = x;
    r.r2 = y;
  }

  @Actor
  void reader2(IntResult4 r) {
    r.r3 = y;
    r.r4 = x;
  }
}

But this example is profound for the way you understand concurrency. First of all, consider whether the outcome (r1, r2, r3, r4) = (1, 0, 1, 0) is plausible. One may come up with the "reordering" explanation: reorder r3 = y and r4 = x, and the result is trivially achievable.

Okay then, since we know where the troubles lie, let’s add fences to inhibit those pesky reorderings:

@JCStressTest
@State
public class FencedIRIW {

    int x;
    int y;

    @Actor
    public void actor1() {
        UNSAFE.fullFence(); (1)
        x = 1;
        UNSAFE.fullFence(); (1)
    }

    @Actor
    public void actor2() {
        UNSAFE.fullFence(); (1)
        y = 1;
        UNSAFE.fullFence(); (1)
    }

    @Actor
    public void actor3(IntResult4 r) {
        UNSAFE.loadFence(); (2)
        r.r1 = x;
        UNSAFE.loadFence(); (2)
        r.r2 = y;
        UNSAFE.loadFence(); (2)
    }

    @Actor
    public void actor4(IntResult4 r) {
        UNSAFE.loadFence(); (2)
        r.r3 = y;
        UNSAFE.loadFence(); (2)
        r.r4 = x;
        UNSAFE.loadFence(); (2)
    }
}

1	Do full fences around stores, just in case.
2	Do load fences around loads, to nail them in place. Somewhat excessive, but safe, don’t you think?

And then run it on POWER:

      [OK] net.shipilev.jmm.FencedIRIW
    (fork: #7, iteration #1, JVM args: [-server])
  Observed state   Occurrences   Expectation  Interpretation
...
      1, 0, 1, 0            47    ACCEPTABLE  Threads see the updates in the inconsistent order
...

Dang it! But how is that possible? We use those magic fences that inhibit reorderings!

The trouble comes from the fact that some machines, notably POWER do not guarantee multi-copy atomicity: "Either all processors see the write, or no processor does (yet)".^[4] The absence of this property precludes the existence of total store order, even if you put fences all around the stores.

An interesting tidbit: fences are usually specified as inhibiting local reorderings, but what we need here is something that enforces the global order, e.g. quiesces the inter-processor interconnect, or does other shady stuff! In POWER, there are hwsync instructions for that. Somewhat accidentally, fullFence maps to this instruction too. Replacing loadFence with fullFence in FencedIRIW would eliminate the unwanted outcome — but that would be purely incidental property of fullFence!

If we put volatile into IRIW example, then the JVM would take additional steps to preserve sequential consistency.^[5] Notably, putting hwsyncs before the volatile reads. Therefore, this example works like a charm:

@JCStressTest
@State
public class VolatileIRIW {
    volatile int x, y;

    @Actor
    public void actor1() {
        x = 1;
    }

    @Actor
    public void actor2() {
        y = 1;
    }

    @Actor
    public void actor3(IntResult4 r) {
        r.r1 = x;
        r.r2 = y;
    }

    @Actor
    public void actor4(IntResult4 r) {
        r.r3 = y;
        r.r4 = x;
    }
}

This example says "GOOD LUCK" to all the users of fences. Unless you understand the hardware in great detail, your code may work only by accident. Deviating from language guarantees by constructing your own low-level synchronization requires much more knowledge than even the experienced guys possess.^[6] Consider yourselves warned.

2.5. Myth: Commit Semantics = Commit to Memory

"But hey, Aleksey, see the part of the specification that actually calls out commits!":

The almost-white noise is what people actually see in the specification text

The sad part is that reading that chapter for the first time, you can get tunnel vision, meet the familiar "commit" word, assign your own meaning to it, and disregard the rest of the chapter. That "commit" in JLS 17.4.8 is not about committing to memory, it is a part of formal validation scheme that tries to verify that there are no self-justifying action loops in the execution. This validation scheme does not generally preclude races, does not preclude observing different writes, or observing them in a non-intuitive order. It only precludes some selected bad cycles.

2.6. Myth: Synchronized and Volatiles Are Completely Different

Now, for something completely different. When I talk with people about memory model, I am frequently surprised that many miss the beautiful symmetry between synchronized and volatile. Both induce synchronization actions. Unlock and volatile write are similar in their "release" action. Lock and volatile read are similar in their "acquire" action. This symmetry allows showing the examples on volatile-s, and almost immediately arrive at a similar synchronized one.

For example, memory effect-wise, these chunks of code are equivalent:

class B<T> {
  T x;
  public void set(T v) {
    synchronized(this) {
      x = v;
    } // "release" on unlock
  }

  public T get() {
    synchronized(this) { // "acquire" on lock
      return x;
    }
  }
}

class B<T> {
  volatile T x;
  public void set(T v) {
    x = v; // "release" on volatile store
  }

  public T get() {
    return x; // "acquire" on volatile load
  }
}

This symmetry allows constructing locks:

int a, b, x;

synchronized(lock) {
  a = x;
  b = 1;
}

int x;
volatile boolean busyFlag;

while (!compareAndSet(lock.busyFlag, false, true)); // burn for the mutual exclusion Gods
  a = x;
  b = 1;
lock.busyFlag = false;

The second block is a plausible implementation of a synchronized section (all right, without wait/notify semantics): it wastes resources spinlooping on lock acquisition, but it is otherwise correct.

3. Pitfalls: Little Deviations Are Fine

This section covers the basic mistakes users do, and shows how a little deviation from the rules can have devastating consequences.

3.1. Pitfall: Non-Synchronized Is Fine

Let’s take the volatile increment atomicity example and slightly modify it:

@JCStressTest
@State
public class VolatileCounters {
  volatile int x;

  @Actor
  void actor1() {
    for (int i = 0; i < 10; i++) {
      x++;
    }
  }

  @Actor
  void actor2() {
    for (int i = 0; i < 10; i++) {
      x++;
    }
  }

  @Arbiter
  public void arbiter(IntResult1 r) {
    r.r1 = x;
  }
}

In jcstress, @Arbiter methods are executing after both @Actor-s completed their work. This is very useful to assert the final state, after all concurrent updates.

Intuitively, by looking at non-looped example from Java Concurrency Stress Tests, you can imagine that a conflict that involves two threads may miss the update. The second intuitive stretch is that you can only lose updates, but you never step back. Notably, if you ask people what the possible outcomes for the test above are, most will answer "Somewhere between 10 and 20, inclusive".

But if you run the test, then:

      [OK] net.shipilev.jmm.VolatileCounters
    (fork: #1, iteration #1, JVM args: [-server])
  Observed state   Occurrences   Expectation  Interpretation
              10       153,217    ACCEPTABLE  $x $y
              11       273,440    ACCEPTABLE  $x $y
              12       465,262    ACCEPTABLE  $x $y
              13       611,123    ACCEPTABLE  $x $y
              14       810,790    ACCEPTABLE  $x $y
              15     1,139,737    ACCEPTABLE  $x $y
              16     1,189,164    ACCEPTABLE  $x $y
              17     1,163,565    ACCEPTABLE  $x $y
              18     1,149,772    ACCEPTABLE  $x $y
              19       986,010    ACCEPTABLE  $x $y
              20     7,449,917    ACCEPTABLE  $x $y
               6             4    ACCEPTABLE  $x $y
               7         6,442    ACCEPTABLE  $x $y
               8        23,762    ACCEPTABLE  $x $y
               9        66,175    ACCEPTABLE  $x $y

What the Hell? Broken JVM! Broken hardware! Why don’t my unsynchronized counters work?

But, consider this timeline of events:

Thread 1: (0 ----------------------> 1)     (1->2)(2->3)(...)(9->10)
Thread 2:   (0->1)(1->2)(...)(8->9)     (1 --------------------------> 2) [end result: 2]

Here, first thread got stuck on the very first update, and after getting unstuck destroyed the result of all nine updates of second thread. But, it would then normally catch up, right? Not if the second thread reads "1" on its last iteration, and gets stuck too, in the end reciprocally destroying the first thread’s updates. This leaves us with "2" as the end result.

Of course, the probability of this interference gets lower as you require longer interference windows: this is why our empirical test results cut off at "6". If we acquire more samples, we will eventually see lower values too.

This is a perfect example how improperly synchronized code may produce oddly unintuitive results. The non-atomic counter may encounter a catastrophic interference between the threads, setting it up for losing a virtually unbounded number of updates.

3.2. Pitfall: Semi-Synchronized Is Fine

The most frequent and impactful misunderstanding of the model is really heart-breaking. Surprisingly, it comes even after studying the memory model in some detail. Consider this example again:

class Box {
  int x;
  public Box(int v) {
    x = v;
  }
}

class RacyBoxy {
  Box box;

  public synchronized void set(Box v) {
    box = v;
  }

  public Box get() {
    return box;
  }
}

Way too many folks would nod through your JMM explanation, and then say this code is properly synchronized. Their reasoning goes like this: reference store is atomic, and therefore there is no need to care about anything else. What that reasoning misses is that the issue here is not about the access atomicity (e.g. whether you can see the non-complete version of the reference itself), but the ordering constraints. The actual failure comes from the fact that reading a reference to an object and reading the object’s fields are distinct under the memory model.

Therefore, you really need to ask yourself, given this test:

@JCStressTest
@State
public class SynchronizedPublish {
  RacyBoxy boxie = new RacyBoxy();

  @Actor
  void actor() {
    boxie.set(new Box(42)); // set is synchronized
  }

  @Actor
  void observer(IntResult1 r) {
    Box t = boxie.get(); // get is not synchronized
    if (t != null) {
      r.r1 = t.x;
    } else {
      r.r1 = -1;
    }
  }
}

…is the outcome 0 plausible? JMM says yes, because there are conformant executions that read a Box reference from RacyBoxy, and then read 0 from its field.

That said, running the example on x86 would almost always lead to this:

      [OK] net.shipilev.jmm.SynchronizedPublish
    (fork: #1, iteration #1, JVM args: [-server])
  Observed state   Occurrences   Expectation  Interpretation
              -1    43,265,036    ACCEPTABLE  Not ready yet
               0             0    ACCEPTABLE  Field is not visible yet
              42     1,233,714    ACCEPTABLE  Everything is visible

Hooray? Take that, Memory Model! But let’s run it on POWER:

      [OK] net.shipilev.jmm.SynchronizedPublish
    (fork: #1, iteration #1, JVM args: [-server])
  Observed state   Occurrences   Expectation  Interpretation
              -1   362,286,539    ACCEPTABLE  Not ready yet
               0          2341    ACCEPTABLE  Field is not visible yet
              42       616,150    ACCEPTABLE  Everything is visible

Oops, there is our acceptable outcome. Note that it plays against the broken mental model that says "synchronized emits a memory barrier at the end, and thus non-synchronized reads are fine". Memory consistency requires cooperation from both parties.

Writing reliable software should be based on the actual language guarantees, not anecdotal observations what hardware is doing today, assuming you even know what hardware you will be running on in the future. Once you start to rely on empirical testing only, prepare to suffer.

3.3. Pitfall: Adding Volatiles Closer To The Problem Helps

The example above is fixable if we add some volatile-s. But the important thing is to know where to add them exactly. For instance, you may see the examples like this:

@JCStressTest
@State
public class SynchronizedPublish_VolatileMeh {
  volatile RacyBoxy boxie = new RacyBoxy(); (1)

  @Actor
  void actor() {
    boxie.set(new Box(42));
  }

  @Actor
  void observer(IntResult1 r) {
    Box t = boxie.get();
    if (t != null) {
      r.r1 = t.x;
    } else {
      r.r1 = -1;
    }
  }
}

1	Oh yes, totally legit, let’s put volatile there

This, however, is incorrect: the model only guarantees happens-before between the actual volatile store and actual volatile load that observes that store. Making the container itself volatile would not help, because there is no volatile write to match the read to. So SynchronizedPublish_VolatileMeh fails in the same way SynchronizedPublish does. It will only help to put volatile over RacyBoxy.box field, so that a volatile store in RacyBoxy.set would match with a volatile load in RacyBoxy.get.

For the same reason, this example has surprising outcomes:

@JCStressTest
@State
class VolatileArray {
  volatile int[] arr = new int[2]; (1)

  @Actor
  void actor() {
    int[] a = arr;
    a[0] = 1; (2)
    a[1] = 1;
  }

  @Actor
  void observer(IntResult2 r) {
    int[] a = arr;
    r.r1 = a[1];
    r.r2 = a[0];
  }
}

1	`volatile` array, better be good!
2	`volatile` store?

Even though the array itself is volatile, the reads and writes to the array elements do not have the volatile semantics. Therefore, the outcome 1, 0 is plausible.

It can be clearly demonstrated on POWER:

      [OK] net.shipilev.jmm.VolatileArray
    (fork: #1, iteration #1, JVM args: [-server])
  Observed state   Occurrences              Expectation  Interpretation
            0, 0       704,015               ACCEPTABLE  Everything else is acceptable too.
            0, 1         1,291               ACCEPTABLE  Everything else is acceptable too.
            1, 0           118   ACCEPTABLE_INTERESTING  Ordering? You wish.
            1, 1    37,136,486               ACCEPTABLE  Everything else is acceptable too.

Advanced APIs, like Atomic{Integer,Long,Reference}Array provide the volatile semantics, if needed. Since JDK 9, VarHandles provide assorted efficient memory access modes too.

3.4. Pitfall: Releasing in Wrong Order

It is amazing to see how many folks glance over the synthetic examples, and do not map them to their day-to-day code. Take this example:

@JCStressTest
@State
public class ReleaseOrderWrong {
    int x;
    volatile int g;

    @Actor
    public void actor1() {
        g = 1;
        x = 1;
    }

    @Actor
    public void actor2(IntResult2 r) {
        r.r1 = g;
        r.r2 = x;
    }
}

Here, the outcome 1, 0 can be observed empirically, which seems to look like a happens-before violation. What gives: didn’t we observe the volatile read g = 1, why don’t we observe x = 1? The answer is that the release order is wrong: we have to do "some-writes" → volatile write → volatile read → "some-reads", in order to guarantee that "some-reads" see the "some-writes". In this example, g = 1 and x = 1 come in the wrong order. The outcome in question can even be explained by a sequentially consistent execution!

Many would laugh this example away, and then proceed doing something like this:

public class MyListMyListMyListIsOnFire {

  private volatile List<Integer> list;

  void prepareList() {
     list = new ArrayList();
     list.add(1);
     list.add(2);
  }

  List<Integer> getMyList() {
     return list;
  }
}

Here, the volatile write of list is coming first, then the updates. If you think that guarantees that callers of getMyList see the entire list contents, rewind to a motivating example above, and think again!

This failure is easy to demonstrate on x86:

@JCStressTest
@State
public class ReleaseOrder2Wrong {

    volatile List<Integer> list;

    @Actor
    public void actor1() {
        list = new ArrayList<>();
        list.add(42);
    }

    @Actor
    public void actor2(IntResult1 r) {
        List<Integer> l = list;
        if (l != null) {
            if (l.isEmpty()) {
                r.r1 = 0;
            } else {
                r.r1 = l.get(0);
            }
        } else {
            r.r1 = -1;
        }
    }
}

…yields:

 [OK] net.shipilev.jmm.ReleaseOrder2Wrong
      (fork: #1, iteration #1, JVM args: [-server])
  Observed state   Occurrences              Expectation  Interpretation
              -1    65,119,848               ACCEPTABLE  Reading null list
               0       252,169   ACCEPTABLE_INTERESTING  List is not fully populated
              42     1,980,313               ACCEPTABLE  Reading a fully populated list

Oops.

3.5. Pitfall: Acquiring in Wrong Order

The symmetric case is when you observe in a different order:

@JCStressTest
@State
public class AcquireOrderWrong {
    int x;
    volatile int g;

    @Actor
    public void actor1() {
        x = 1;
        g = 1;
    }

    @Actor
    public void actor2(IntResult2 r) {
        r.r1 = x;
        r.r2 = g;
    }
}

  [OK] net.shipilev.jmm.AcquireOrderWrong
 (fork: #1, iteration #1, JVM args: [-server])
  Observed state   Occurrences   Expectation  Interpretation
            0, 0    60,839,389    ACCEPTABLE  All other cases are acceptable.
            0, 1           579    ACCEPTABLE  All other cases are acceptable.
            1, 0        41,053    ACCEPTABLE  All other cases are acceptable.
            1, 1    40,122,239    ACCEPTABLE  All other cases are acceptable.

In this case, the outcome 0, 1 is completely plausible, and can also be explained by a sequentially consistent execution: r.r1 = x (reads 0), x = 1, g = 1, r.r2 = g (reads 1). Again, it is easy to laugh this example off, but then discover an actual bug hiding in a little more complicated production code. Be vigilant!

3.6. Pitfall: Acquiring and Releasing in Wrong Order

Combining two previous pitfalls into one major pitfall, by doing both acquire and release wrong:

@JCStressTest
@State
public class AcquireOrderWrong {
    int x;
    volatile int g;

    @Actor
    public void actor1() {
        g = 1;
        x = 1;
    }

    @Actor
    public void actor2(IntResult2 r) {
        r.r1 = x;
        r.r2 = g;
    }
}

…yields an interesting case too. Here, the outcome (1, 0) can not be explained by any sequentially-consistent execution! But of course, it is a naked data race, and here is our "bad" result:

  [OK] net.shipilev.jmm.AcquireReleaseOrderWrong
 (fork: #1, iteration #1, JVM args: [-server])
Observed state   Occurrences     Expectation  Interpretation
          0, 0   108,771,152      ACCEPTABLE  All other cases are acceptable.
          0, 1     1,137,881      ACCEPTABLE  All other cases are acceptable.
          1, 0        15,218      ACCEPTABLE  All other cases are acceptable.
          1, 1    29,451,719      ACCEPTABLE  All other cases are acceptable.

3.7. Avoiding Pitfalls

The most important thing that you would even learn from the Java Memory Model is the notion of safe publication. For volatiles, it goes like this:

Note the caveats: deviating just a little bit from them obliterates the guarantees, as examples in this section show.^[7] ^[8] This is the golden example:

@JCStressTest
@Outcome(id = "1, 0", expect = Expect.FORBIDDEN,  desc = "Happens-before violation")
@Outcome(             expect = Expect.ACCEPTABLE, desc = "All other cases are acceptable.")
@State
public class SafePublication {

    int x;
    volatile int ready;

    @Actor
    public void actor1() {
        x = 1;
        ready = 1;
    }

    @Actor
    public void actor2(IntResult2 r) {
        r.r1 = ready;
        r.r2 = x;
    }
}

The naming choices for special fields can help to quickly diagnose problems in the source code. See how calling a volatile field above ready clearly highlights the correct order of operations.

…and it indeed does not show the "bad" outcome on all platforms. This is POWER, weak hardware model as it is, there is no match for the language guarantees:

      [OK] net.shipilev.jmm.SafePublication
    (fork: #1, iteration #1, JVM args: [-server])
  Observed state   Occurrences   Expectation  Interpretation
            0, 0    69,358,115    ACCEPTABLE  All other cases are acceptable.
            0, 1     2,402,453    ACCEPTABLE  All other cases are acceptable.
            1, 0             0     FORBIDDEN  Happens-before violation
            1, 1    44,989,512    ACCEPTABLE  All other cases are acceptable.

Using the symmetry with synchronized, we can construct similar cases with Java locks.

4. Wishful Thinking: Hold My Beer While I Am…

This section describes the usual abuses by the more advanced JMM users, and explains why they do not work.

4.1. Wishful Thinking: My Code Has All Happens-Befores!

First, a little caveat about the phraseology. You can frequently see people handwave with "happens-before" like there is no tomorrow. Notably, faced with this code:

int x;
volatile int g;

void m() {
  x = 1;
  g = 1;
}

void r() {
  int lg = g;
  int lx = x;
}

…they say that g = 1 happens-before int lg = g. This train-wrecks the reasoning further by logically arriving at conclusion int lx = x would always see x = 1 (since x = 1 hb g = 1, and int lg = g hb int lx = x too). This is a very easy mistake to make, and you have to keep in mind that happens-before (and other orders in JMM formalism) is applied to actions, not the statements.

Having the little different designations for actions help to distinguish actions from statements. I like to use this nomenclature: write(x, V) writes value V to variable x; and read(x):V reads value V from variable x. In that nomenclature, you can say write(g, 1) happens-before read(g):1, because now you describe the actual action, not just some abstract program statement.

For completeness, these are the valid executions under JMM:

write(x, 1) →_hb write(g, 1) … read(g):0 →_hb read(x):0
write(x, 1) →_hb write(g, 1) … read(g):0 →_hb read(x):1
write(x, 1) →_hb write(g, 1) →_hb read(g):1 →_hb read(x):1

This execution has broken HB consistency, read(x) should be observing the latest write to x, but it does not: write(x, 1) →_hb write(g, 1) →_hb read(g):1 →_hb read(x):0

Note that a good API spec is careful to speak about actions, and their connection with the actual observable events. E.g. the java.util.concurrent package spec says:

The methods of all classes in java.util.concurrent and its subpackages extend these guarantees to higher-level synchronization. In particular: Actions taken by the asynchronous computation represented by a Future happen-before actions subsequent to the retrieval of the result via Future.get() in another thread.

4.2. Wishful Thinking: Happens-Before Is The Actual Ordering

Now, the notion of orders somehow wrecks people minds in assuming that "order" from the set theory used in JMM spec somehow relates to the physical execution order. Notably, people claim that if two actions are in program order, that’s the order they are executing in (which precludes any optimization!); or if they are tied in happens-before, then they also execute in that happens-before order (which precludes any optimization too, given HB is an extension of PO).

It goes into ridiculous examples, like this:

@JCStressTest
@State
public class ReadAfterReadTest {
  int a;

  @Actor
  void actor1() {
    a = 1;
  }

  @Actor
  void actor2(IntResult2 r) {
    r.r1 = a;
    r.r2 = a;
  }
}

The actual test is more complicated due to the need to dodge compiler optimizations, but this example would also do. Just imagine that compiler does not coalesce the common reads of a.

Is the 1, 0 result plausible? E.g. if we read 1 already, can we read 0 next time? JMM says we can, because the execution producing this outcome does not violate the memory model requirements. Informally, we can say that a decision on what a particular read can observe is made for each read in isolation. Since both reads are racy, both may return either 0 or 1.

This can be demonstrated even on otherwise strong hardware, like x86:

    [OK] o.o.j.t.volatiles.ReadAfterReadTest
    (fork: #1, iteration #1, JVM args: [-server])
Observed state   Occurrences              Expectation  Interpretation
          0, 0    16,736,450               ACCEPTABLE  Doing both reads early.
          0, 1         3,941               ACCEPTABLE  Doing first read early, not surprising.
          1, 0        84,477   ACCEPTABLE_INTERESTING  First read seen racy value early, and the s...
          1, 1   108,816,262               ACCEPTABLE  Doing both reads late.

Happens-before order is only useful for the happens-before consistency rule, which describes what writes a particular read can observe. It does not mandate any particular physical order of actions. For example, it does not enforce the physical execution order of arbitrary actions.

This goes against intuition, and indeed, many would argue that you cannot see 1 and then see 0 in two back-to-back reads from the same location. That argument hinges on a definition of "then", and for most people it includes the intuitive/naive model of time, that gets contaminated with the program order. But in the Java Memory Model, "then" is defined by a partial happens-before order (for some paired write-read actions) and its consistency rules, and not by the program order itself. Therefore, it is useless to think about two independent reads to happen in any particular order here.^[9]

Marking field a with volatile modifier precludes the 1, 0 outcome, because then the synchronization order consistency rule will take power over both reads. This will conclude that if the first read sees the x = 1 write, the second read should too.

4.3. Wishful Thinking: Synchronization Order Is The Actual Ordering

Next up, a more metaphysical question. Synchronization Order (SO) is specified to be a total order over actions. Getting back to the IRIW-like example:

@JCStressTest
@State
class IRIW {
  volatile int x, y;

  @Actor
  void writer1() {
    x = 1;
  }

  @Actor
  void writer2() {
    y = 1;
  }

  @Actor
  void reader1(IntResult4 r) {
    r.r1 = x;
    r.r2 = y;
  }

  @Actor
  void reader2(IntResult4 r) {
    r.r3 = y;
    r.r4 = x;
  }
}

We know that the (1, 0, 1, 0) outcome is precluded by SO rules. But if we take two "CPU demons" and let them observe the machine state of CPUs running readers, would it be possible for these CPU demons to have the inconsistent worldviews?

The answer is: yes, of course, we can see that both CPUs have mixed understanding which event took place first: x = 1 or y = 1. Therefore, even though specification requires that actions are in total order, the actual physical execution order may differ, depending on your observation point.^[10]

Having said that, the question we should be asking ourselves is not "What does the machine see?", but "What does the machine allow programs to see?". Some machines hide these details (and arguably pay with performance), some do not. In that case, runtime cooperation is required to avoid observing unwanted state. In the end, runtime would only allow to see the result that is consistent with some total order of events, even though it was physically chaos.

Do you want to talk to chaotic hardware? Nope? Use the language memory models then, and let others take care of this.

4.4. Wishful Thinking: Unobserved Synchronized Has Memory Effects

If you haven’t internalized that already, Java Memory Model only guarantees ordering across matching releases and acquires. This means that unobserved releases/acquires have no meaning under memory model, and are not required to produce memory effects. But quite often, you will see code like this:

void synchronize() {
  synchronized(something) {}; // derp... memory barrier?
}

We already know that barriers are implementation details, and may be omitted by the runtime. In fact, it is easy to demonstrate this with the following test:

@JCStressTest
@State
public class SynchronizedAreNotBarriers {
  int x, y;

  @Actor
  public void actor1() {
    x = 1;
    synchronized (new Object()) {} // o-la-la, a barrier.
    y = 1;
  }

  @Actor
  public void actor2(IntResult2 r) {
    r.r1 = y;
    synchronized (new Object()) {} // o-la-la, a barrier.
    r.r2 = x;
  }
}

Is the 1, 0 outcome plausible? Naively, with synchronized-as-barrier model it is forbidden. But it is easy for the runtime to exploit the simple fact that synchronizing on new Object() has no effect, and purge them out. The resulting optimized code would not have any barriers. And so even on x86 this would happen:

[OK] net.shipilev.jmm.SynchronizedAreNotBarriers
 (fork: #1, iteration #1, JVM args: [-server])
  Observed state   Occurrences   Expectation  Interpretation
            0, 0     2,705,391    ACCEPTABLE  All other cases are acceptable.
            0, 1        40,709    ACCEPTABLE  All other cases are acceptable.
            1, 0        13,356    ACCEPTABLE  Racy read of x
            1, 1    61,341,794    ACCEPTABLE  All other cases are acceptable.

Unobserved synchronized blocks are not barriers. In fact, you cannot even rely on barrier understanding, because runtimes would exploit the optimization opportunities without consulting you. The only way to build a reliable software is adhering to language rules, not shady techniques you barely understand.

4.5. Wishful Thinking: Unobserved Volatiles Have Memory Effects

A similar example concerns volatile-s. It is tempting to read the JSR 133 Cookbook again, and imagine that since volatile implementations mean barriers, we can use volatiles to get barrier semantics! This should totally work, right?

@JCStressTest
@State
public class VolatilesAreNotBarriers {

    static class Holder {
        volatile int GREAT_BARRIER_REEF;
    }

    int x, y;

    @Actor
    public void actor1() {
        Holder h = new Holder();
        x = 1;
        h.GREAT_BARRIER_REEF = h.GREAT_BARRIER_REEF;
        y = 1;
    }

    @Actor
    public void actor2(IntResult2 r) {
        Holder h = new Holder();
        r.r1 = y;
        h.GREAT_BARRIER_REEF = h.GREAT_BARRIER_REEF;
        r.r2 = x;
    }
}

If you run it on modern HotSpot, then you will see that -server (C2) compiler still leaves barriers behind. But that seems to be an implementation inefficiency: while it purges both Holder instances and volatile ops, it loses the association between the actual store and the relevant barrier shortly after parsing.

My duct-taped Graal runs show that Graal seems to eliminate both instances and associated barriers on x86: disassembly shows no barriers, and performance is 10x faster in actor methods — but, alas, we cannot run it on POWER yet. But we certainly would not like to throw compilers under the bus and say this is forbidden.

This actually means that volatiles and fences are not easily interchangeable. Volatiles are weaker than fences when it comes to reorderings. Fences are weaker than volatiles when you need to gain high-level properties like sequential consistency, unless you put fullFence-s everywhere.

4.6. Wishful Thinking: Mismatched Ops Are Okay

From the example above it can also be understood that releasing on one variable, and acquiring on another is not guaranteed to bring memory effects:

@JCStressTest
@State
public class YourVolatilesWillCallMyVolatiles {
    int x, y;
    volatile int see;
    static volatile int BARRIER;

    @Actor
    void thread1() {
        x = 1;
        see = 1; // release on $see
        y = 1;
    }

    @Actor
    void thread2(IntResult3 r) {
        r.r1 = y;
        r.r3 = BARRIER; // acquire on $BARRIER
        r.r2 = x;
    }
}

The outcome 1, 0 is allowed here. BARRIER is not really a barrier! Some VM implementations would still emit the same-looking barriers on both accesses, and thus accidentally provide the desired semantics, but this is not a language guarantee.

If you see a code like this in the JDK Class Library, it does not mean you can use this approach without remorse. The Class Library sometimes makes assumptions about the exact JVM it is running on to bring the memory consistency guarantees to JDK users. It is very fragile to use the same code in 3rd party libraries.

4.7. Wishful Thinking: `final`-s Are Replaceable With `volatile`-s

Let’s put a final nail into the volatile-as-barrier coffin. Imagine you have a class with a volatile field initialized in constructor. Suppose you publish the instance of this class via a race. Are you guaranteed to see the set value? E.g.:

@JCStressTest
@State
public class VolatileMeansEverythingIsFine {
  static class C {
    volatile int x;
    C() { x = 42; }
  }

  C c;

  @Actor
  void thread1() {
    c = new C();
  }

  @Actor
  void thread2(IntResult1 r) {
    C c = this.c;
    r.r1 = (c == null) ? -1 : c.x;
  }
}

…is outcome 0 plausible here? "But volatile-s are so strong!" — one could say. "But barriers there are so strong!" — somebody else could say. "Who cares about barriers, when causality prevents this!" — others would say.

The fact of the matter is that JMM allows seeing 0 in this example! Only final would preclude 0, and we cannot have the field both final and volatile. This is one of the peculiar properties of the model, which can be solved by forcing all initializations perform as final ones, probably without a prohibitive performance cost.^[11]^[12]

Notice this is yet another example how evil the data races are. The volatile modifier on the field itself does nothing to prevent the race, because it is in the wrong place: if it were releasing/acquiring the instance itself, everything would work fine.

4.8. Wishful Thinking: TSO Machines Protect Us

Even when optimizers are not in the picture, the minute details in code generation may affect outcomes. The wishful thinking of many is that machines with Total Store Order (TSO) are saving us from bad things. This almost always hinges on the belief that if you cannot imagine why the compiler would mess up your code, then it wouldn’t. This gives rise to the urban legend that "safe initialization" is not required for x86, because it will preserve the order of field initializations and publication.

Let’s construct a simple case where an object with four fields is initialized and published via a race:

@JCStressTest
@State
public class UnsafePublication {
    int x = 1;

    MyObject o; // non-volatile, race

    @Actor
    public void publish() {
        o = new MyObject(x);
    }

    @Actor
    public void consume(IntResult1 res) {
        MyObject lo = o;
        if (lo != null) {
            res.r1 = lo.x00 + lo.x01 + lo.x02 + lo.x03;
        } else {
            res.r1 = -1;
        }
    }

    static class MyObject {
        int x00, x01, x02, x03;
        public MyObject(int x) {
            x00 = x;
            x01 = x;
            x02 = x;
            x03 = x;
        }
    }
}

Even on x86 it will yield interesting behaviors, as if we don’t see all the stores from the constructor:

      [OK] net.shipilev.jmm.UnsafePublication
    (fork: #1, iteration #1, JVM args: [-server])
  Observed state   Occurrences   Expectation  Interpretation
              -1    86,515,664    ACCEPTABLE  The object is not yet published
               0           751    ACCEPTABLE  The object is published, but all fields are 0.
               1           297    ACCEPTABLE  The object is published, at least 1 field is visible.
               2           211    ACCEPTABLE  The object is published, at least 2 fields are visible.
               3           953    ACCEPTABLE  The object is published, at least 3 fields are visible.
               4     4,057,524    ACCEPTABLE  The object is published, all fields are visible.

These kinds of failures are not theoretical! In practice, either safe publication or safe initialization (e.g. making all fields final) will prohibit these the intermediate outcomes. See more at "Safe Construction and Safe Initialization".

Safe initialization is a very useful pattern that can guard against racy publication. Do not ignore or deviate from it in your code! It may save days of debugging time for you.

Another horrifying wishful thinking comes in a deceptively simple form: you may mistakenly believe it is easy to sanitize the racy result. Taking the example from this section, no matter what you do in consume(), it would not save you from the unfolding race. If your publisher does not cooperate with the consumer, all bets are off.

final-s may protect publishers from non-cooperating racy consumers, but not vice-versa.

4.9. Wishful Thinking: Benign Races are Resilient

There are special forms of benign races (oxymoron, if you ask me). Their benignity comes from the safe initialization rules. They usually take this form:

@JCStressTest
@State
public class BenignRace {

  @Actor
  public void actor1(IntResult2 r) {
    MyObject m = get();
    if (m != null) {
      r.r1 = m.x;
    } else {
      r.r1 = -1;
    }
  }

  @Actor
  public void actor2(IntResult2 r) {
    MyObject m = get();
    if (m != null) {
      r.r2 = m.x;
    } else {
      r.r2 = -1;
    }
  }

  MyObject instance;

  MyObject get() {
    MyObject t = instance;  // read once
    if (t == null) {        // not yet there...
      t = new MyObject(42);
      instance = t;         // try to install new version
    }
    return t;
  }

  static class MyObject {
    final int x;            // safely initialized
    MyObject(int x) {
        this.x = x;
    }
  }
}

This only works if class is safely initialized (i.e. has only final fields), and instance field is read only once. Both conditions are critical for the race to be benign in that example. If either condition is relaxed, then race is suddenly and abruptly stops being benign.

For example, relaxing the safe initialization rule opens up the failure described in Pitfall: Semi-Synchronized Is Fine:

@JCStressTest
@State
public class NonBenignRace1 {
  ...
  static class MyObject {
    int x;              // WARNING: non-final
    MyObject(int x) {
      this.x = x;
    }
  }
}

Relaxing the single read rule also breaks benignity, by setting up the failure described in Wishful Thinking: Happens-Before Is The Actual Ordering:

@JCStressTest
@State
public class NonBenignRace2 {
  ...
  MyObject instance;

  MyObject get() {
      if (instance == null) {
          instance = new MyObject(42);
      }
      return instance; // WARNING: Second read
  }
  ...
}

This may appear counter-intuitive: if we read null from instance, we take a corrective action with storing new instance, and that’s it. Indeed, if we have the intervening store to instance, we cannot see the default value, and we can only see that store (since it happens-before us in all conforming executions where first read returned null), or the store from the other thread (which is not null, just a different object). But the interesting behavior unfolds when we don’t read null from the instance on the first read. No intervening store takes place. The second read tries to read again, and being racy as it is, may read null. Ouch.

This is actually exploitable by Evil Optimizers:

T get() {
  if (a == null) {
    a = compute();
  }
  return a;
}

Introduce temporary variables (e.g. do SSA transform, and then some):

T get() {
  T t1 = a;
  if (t1 == null) {
    T t3 = compute();
    a = t3;
    return t3;
  }
  T t2 = a;
  return t2;
}

There are no ordering constraints on independent reads without intervening writes, so mess around with read ordering. First, observe that once control goes into if branch, we never reach T t2 = a, so the intervening write to a is invisible to T t2 = a, move the read before the branch. Second, shuffle around the independent reads T t1 and T t2 — can do that, independent reads:

T get() {
  T t2 = a;
  T t1 = a;
  if (t1 == null) {
    var t3 = compute();
    a = t3;
    return t3;
  }
  return t2;
}

This trivially gets you null at return t2.

Exploiting benign races is seldom profitable in usual code. In library code, it sometimes improves performance significantly, especially on non-TSO hardware where volatile reads are not cheap. But it takes utmost discipline to make the race really benign. When using benign races, it is your job to prove why race is actually benign.

5. Horror Circus: Kinda Works But Horrifying

This section is provided as the comic relief. The examples here are not a call to action. The examples here work in the same way your father-in-law is juggling chainsaws and still has two hands (this particular minute).

5.1. Horror Circus: Synchronizing on Primitives

In Java, every object is potentially a lock. This includes wrappers for primitives, which makes it possible to synchronize on primitives (their boxed wrappers, actually). Alas, as we have seen before, synchronizing on new Object() does not work, and we need to make sure that primitive values map to the same wrapper objects. Luckily, the Java specification gives another concession to us, and advertises that some small values are autoboxed to the same wrapper objects. These two (arguably overlooked) part of the specification give rise to this contraption:

class HorribleSemaphore {
  final int limit;
  public HorribleSemaphore(int limit) {
    if (limit < 0 || limit > 128) {
      throw new IllegalArgumentException("Incorrect: " + limit);
    }
    this.limit = limit;
  }
  void guardedExecute(Runnable r) {
    synchronized (Integer.valueOf(ThreadLocalRandom.current().nextInt(limit))) {
      // no more than $limit threads here...
      r.run();
    }
  }
}

It does indeed allows no more than $limit number of threads to execute in guardedExecute at any given time. It is even faster than java.util.concurrent.Semaphore, but it comes with some serious caveats:

No guarantees on lower bound of the executing threads — it comes a little better than a single lock at managing contention, but that is it;
The lock objects are static, which means that acquiring the lock on Integer wrapper outside the HorribleSemaphore code robs its permits. Or maybe that’s a feature? Think about it: you can change the limit without involving the HorribleSemaphore instance! Talk about open for extension! SOLID design FTW.
The implementation-imposed limit is 128 threads. Well, we can add +128 to the limit to claim negative values too, which gives us whooping 256 threads for an upper bound! Also, JVM is future-proof enough to gives us a JVM flag that controls the autoboxing cache size — AutoBoxCacheMax — we can always tune it up!

5.2. Horror Circus: Synchronizing on Strings

Hi, Shady Mess here! Are you tired of those unnamed locks hanging around your program? Are you dying with multiple lock managers that pollute your precious bodily fluids? Now you can use Strings as locks! This is a limited offer only. The StringTable cannot hold any longer he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, unclaimed locks lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the song of deadlocks will extinguish the voices of mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

Why have:

private final Object LOCK = new Object();

synchronized(LOCK) {
	...
}

…when you can have:

synchronized("Lock") {
	...
}

Of course, it comes with the caveat that the String instance you get from the String literal is shared within the execution. But, this is a blessing in disguise — it is shared between class loaders, too! Which means you can synchronize the code without figuring out how to pass static final-s between class loaders! Woo-hoo!

Also, nobody prevents us from carefully namespacing the locks:

synchronized("This is my lock. " +
             "You cannot have it. " +
             "Get your own String to synchronize on. " +
             "There are plenty of Strings for everyone.") {
	...
}

…which also lends itself for programmatic access:

public void doWith(String whosLockThatIs, Runnable r) {
  synchronized(("This is " + whosLockThatIs + " lock. " +
               "You cannot have it. " +
               "Get your own String to synchronize on. " +
               "There are plenty of Strings for everyone.").intern()) {
	r.run();
  }
}

doWith("mine", () -> System.out.println("Peekaboo"));

What a magical language that gives me these powers!

Conclusion and Parting Thoughts

It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.

— Mark Twain

Formal JMM rules are complex to understand. Not everyone is equipped with time and capacity to figure out the corner cases, and so everyone inevitably uses a repertoire of working constructions. Alas, that repertoire often comes from urban legends, anecdotal observations, or blind guessing. We should stop that habit. Build the repertoire based on the actual language guarantees!
To deal with 99.99% of all memory ordering problems, you have to know safe publication (the ordering rules of proper synchronization), and safe initialization (the safety rules protecting from inadvertent races). Both are explored in greater detail in "Safe Publication and Safe Initialization in Java". Additional 0.00999% is understanding how benign races work. At this level, memory model rules are actually very simple and intuitive.
The problems start when people try to dive deeper. Anything beyond the language/library guarantees exposes you to the intricate details of the inner workings of runtimes, compilers and hardware, which hardly anybody understands completely. Attempts to simplify it to "easy" interpretations like roach motel, barriers, etc. sometimes gives incorrect results, as multiple examples above show. Forget how hardware works, forget how optimizers work! This understanding is fine for educational and entertainment purposes, but it is actively damaging to correctness proofs. Mostly because you will eventually equate "I cannot build an example of how it may fail" to "This never fails", and that hubris will bite you at the most unexpected moment.
Some APIs provide you with bells and whistles that may sound exciting: lazySet/putOrdered, fences, acquire/release operations in VarHandles — but their existence does not mean you have to use them! Those are left for the power users who can actually reason about them. If you want to bend the rules "just a little bit", you should prepare yourself for jumping a very large cliff, because that’s what you are doing.

All of the above was a buildup for…

1. Gleb Smirnov (busting his cover): "Hello, yes, this is the Bayesian Conspiracy speaking. Absence of evidence is weak evidence of absence."

2. These optimizations are easy in few selected cases, for example calling m() within the constructor of its own class.

3. This sounds very similar to C/C++11 std::atomic(…, memory_order_relaxed), which is what it is modeled after. Very convenient for hardware concurrency testing, as in this example.

4. This, and many other interesting examples, is covered by the work of Peter Sewell, Susmit Sarkar, and others. I would strongly recommend reading "A Tutorial Introduction to the ARM and POWER Relaxed Memory Models" for the hardware perspective.

5. Fun fact: IRIW is still broken with VarHandle getAcquire/setRelease. As in C/C++11, these operations do not require a visible total order of stores, which IRIW asserts.

6. Doug Lea: "There are contexts in which designing via fences can be more productive than per-variable. These can arise for example when all races among two variables can be tolerated, but they still require some ordering guarantees w.r.t. other variables. Note that these are also forms of "benign" races, but they are not benignly neglected. It takes a lot of thought to figure them out. (Or maybe someday use some still-experimental tools, see Fender)"

7. Fun fact: Sometimes you can replace volatile store/load with setRelease/getAcquire to squeeze a little more performance juice. This gives up sequential consistency, and therefore needs to be carefully analyzed for correctness. This footnote was paid for by the prophets of "Sequential Consistency or Death" school of thought.

8. Fun fact: The lack of any semantic need for SC among concurrent readers is the starting point for nearly all distributed-consistency models and protocols. This footnote was paid for by the prophets of "Sequential Consistency Is Death" school of thought.

9. Fun fact: Most hardware actually provides a more intuitive property, coherence: "All writes to one particular location are observed in one particular order" (this is weaker than a total order of all stores), which precludes 1, 0 outcome here. Arguably, that is the way humans like to think, but this is much stronger than the Java Memory Model. Some memory models, like C/C++11 Memory Model, describe the modification order over atomics that is similar to the definition above, and matching closely what hardware provides. In JDK 9+, VarHandles "opaque" mode has similar semantics, even though it is hard to specify in the realm of current JMM.

10. In the same way the limit on the speed of communication in Special Relativity gives rise to the relativity of simultaneity, the propagation delays on real hardware deconstructs the intuitive notion of simultaneity and global time. (I’m not the first one who made this connection)

11. This is so surprising that implementations on some platforms where this behavior is a norm (e.g. OpenJDK PPC port), chose to emit appropriate barriers when volatile store happen in constructors, even though it is not required by specification.

12. Doug Lea: "This final vs volatile issue has led to some twisty constructions in java.util.concurrent to allow 0 as the base/default value in cases where it would not naturally be. This rule sucks and should be changed."