Safe Publication and Safe Initialization in Java

Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

Note	This post is also available in ePUB and mobi.

Preface

Most formalisms in Java Memory Model are usable only as the deep reference to explain what constitutes the legal outcomes in Java programs. If you don’t believe me, go and read "Java Memory Model Pragmatics" transcript again. Most people are trying to digest the JMM rules as stated, and figure out the high-level constructions that are usable without causing developers' heads to explode.

Two of these constructions are "Safe Publication" and "Safe Initialization". We will take a look at Singletons to explain both constructions. It is important to read these singleton examples as a tutorial in Java concurrency, and not as a promotion for this particular design pattern. The resurgence of lazy val-like language constructs often forces us to go for the same pattern anyhow.

This post is a translation and update of a 2 year old post I wrote in Russian. Thanks to Gleb Smirnov, Nitsan Wakart, and Joe Kearney for reviews for content and grammar!

Theory

Singletons and Singleton Factories

It is mildly irritating when people confuse Singletons, and Singleton Factories. For the sake of our discussion, we need to clearly separate these two notions. Singleton is an object that has only a single instance at every point of program life-cycle. Singleton Factory is an object that maintains Singletons. Of course, you can conflate these two in a single implementation, but that is not the point of this post.

A reasonable SingletonFactory has a few properties:

It provides the public API for getting a Singleton instance.
It is thread-safe. No matter how many threads are requesting a Singleton, all threads will get the same Singleton instance, regardless of the current state.
It is lazy. One can argue about this, but non-lazy factories are not interesting for our discussion. Singleton initialization should happen with the first request for a Singleton, not when Singleton class is initialized. If no one wants a Singleton instance, it should not be instantiated.
It is efficient. The overheads for managing a Singleton state should be kept at minimum.

One can be reasonably sure this is not a good SingletonFactory:

public class SynchronizedCLFactory {
  private Singleton instance;

  public Singleton get() {
    synchronized (this) {
      if (instance == null) {
        instance = new Singleton();
      }
      return instance;
    }
  }
}

…as it has the properties (1), (2), and (3); but lacks property (4), since we do synchronization on every access for get().

This observation gives rise to Double-Checked Locking idiom. It tries to evade synchronization if Singleton is already initialized, which is a common case after the one-time synchronization. Understanding that, most people go ahead and write:

public class UnsafeDCLFactory {
  private Singleton instance;

  public Singleton get() {
    if (instance == null) {  // read 1, check 1
      synchronized (this) {
        if (instance == null) { // read 2, check 2
          instance = new Singleton();
        }
      }
    }
    return instance; // read 3
  }
}

Alas, this construction does not work properly for two reasons.

One could think that after "check 1" had succeeded, the Singleton instance is properly initialized, and we can return it. That is not correct: the Singleton contents are only fully visible to the constructing thread! There are no guarantees that you will see Singleton contents properly in other threads, because you are racing against the initializer thread. Once again, even if you have observed the non-null instance, it does not mean you observe its internal state properly. In JMM-ese speak, there is no happens-before between the initializing stores in Singleton constructor, and the reads of Singleton fields.
Notice that we do several reads of instance in this code, and at least "read 1" and "read 3" are the reads without any synchronization — that is, those reads are racy. One of the intents of the Java Memory Model is to allow reorderings for ordinary reads, otherwise the performance costs would be prohibitive. Specification-wise, as mentioned in happens-before consistency rules, a read action can observe the unordered write via the race. This is decided for each read action, regardless what other actions have already read the same location. In our example, that means that even though "read 1" could read non-null instance, the code then moves on to returning it, then it does another racy read, and it can read a null instance, which would be returned!

Safe Publication

Now we are going to describe the concept of safe publication. Safe publication differs from a regular publication on one crucial point. Safe publication makes all the values written before the publication visible to all readers that observed the published object. It is a great simplification over the JMM rules of engagement with regards to actions, orders and such.

There are a few trivial ways to achieve safe publication:

Exchange the reference through a properly locked field (JLS 17.4.5)
Use static initializer to do the initializing stores (JLS 12.4)
Exchange the reference via a volatile field (JLS 17.4.5), or as the consequence of this rule, via the AtomicX classes
Initialize the value into a final field (JLS 17.5).

Let us try to exploit each of those ways. The most trivial example is publishing through a properly locked field:

public class SynchronizedCLFactory {
  private Singleton instance;

  public Singleton get() {
    synchronized(this) {
      if (instance == null) {
        instance = new Singleton();
      }
      return instance;
    }
  }
}

This is the most trivial example spec-wise:

Mutual exclusion during get() calls allow only a single thread to do the Singleton initialization.
The lock acquisition and releasing yield the synchronization actions that are bound in synchronizes-with, and then with happens-before. This forces the threads acquiring the lock to see the result of all the actions that were done before the lock release.

Classic holder idiom does roughly the same, but piggy-backs on class initialization locks. It safely publishes the object by doing the initialization in the static initializer. Note this thing is lazy, since we do not initialize Holder until the first call to get():

public class HolderFactory {
  public static Singleton get() {
    return Holder.instance;
  }

  private static class Holder {
    public static final Singleton instance = new Singleton();
  }
}

This is how it works spec-wise:

Class initialization is done under the lock, as per JLS 12.4.2. Class initialization lock provides the mutual exclusion during the class initialization, that is, only a single thread can initialize the static fields.
The release of class initialization lock plays the necessary role in establishing the happens-before relationships between the actions in static initializers and any users of the static fields. Naively speaking, the propagation of memory effects requires any reader of static field to acquire the class initialization lock first, but JLS allows to elide that locking, if the memory effects are still maintained. Indeed, modern VMs do this optimization routinely.

Now, get back to the infamous volatile DCL, that works because now we safely publish the singleton via volatile:

public class SafeDCLFactory {
  private volatile Singleton instance;

  public Singleton get() {
    if (instance == null) {  // check 1
      synchronized(this) {
        if (instance == null) { // check 2
          instance = new Singleton();
        }
      }
    }
    return instance;
  }
}

How can this concept be backed up by spec? Here’s how:

Volatile write and volatile reads of instance yield the actions bound in synchronizes-with order, and therefore form happens-before. That means the actions preceding the volatile store (that is, the actions in constructors) precede any actions after reading the instance. In other words, those threads that called get() will observe a fully-constructed Singleton.
Volatile reads and writes of instance yield synchronization actions that are totally ordered, and consistent with program order. Therefore two consecutive reads of instance are guaranteed to see the same value, in the absence of intervening write to instance.

There is another way to provide the same guarantees: final fields. Since it is already too late to write to final field outside of constructor, we have to do a wrapper.

public class FinalWrapperFactory {
  private FinalWrapper wrapper;

  public Singleton get() {
    FinalWrapper w = wrapper;
    if (w == null) { // check 1
      synchronized(this) {
        w = wrapper;
        if (w == null) { // check2
          w = new FinalWrapper(new Singleton());
          wrapper = w;
        }
      }
    }
    return w.instance;
  }

  private static class FinalWrapper {
    public final Singleton instance;
    public FinalWrapper(Singleton instance) {
      this.instance = instance;
    }
  }
}

How does it map to the actual spec?

The constructor writing out the final field has a freeze action at the end. You can read more on final fields semantics in "JMM Pragmatics". In short, if all the access chains contain the freeze action in-between the initializing stores to Singleton fields and the read of any Singleton field, we are allowed to observe only the initializing stores. Note this means if anyone had seen the Singleton instance before, then all bets are off: there is an access chain that bypasses the freeze action. In other words, it would be too late to "protect" already published object.
Also notice we only do a single non-synchronized read of wrapper. Even though this read is racy, we recover from accidentally reading "null". If we were to read wrapper a second time before returning, that would set us up for an opportunity to read "null" again, and then return it.

For completeness, let us do a factory that still publishes via the race, but does only a single racy read, hopefully recovering after reading "null":

public class UnsafeLocalDCLFactory implements Factory {
  private Singleton instance; // deliberately non-volatile

  @Override
  public Singleton getInstance() {
    Singleton res = instance;
    if (res == null) {
      synchronized (this) {
        res = instance;
        if (res == null) {
           res = new Singleton();
           instance = res;
        }
      }
    }
    return res;
  }
}

The introduction of local variable here is a correctness fix, but only partial: there still no happens-before between publishing the Singleton instance, and reading of any of its fields. We are only protecting ourselves from returning "null" instead of Singleton instance. The same trick can also be regarded as a performance optimization for SafeDCLFactory, i.e. doing only a single volatile read, yielding:

public class SafeLocalDCLFactory implements Factory {
  private volatile Singleton instance;

  @Override
  public Singleton getInstance() {
    Singleton res = instance;
    if (res == null) {
      synchronized (this) {
        res = instance;
        if (res == null) {
          res = new Singleton();
          instance = res;
        }
      }
    }
    return res;
  }
}

There, these are the basic idioms for publishing.

Safe Initialization

Moving on. Java allows us to declare an object in a manner always safe for publication, that is, provides us with an opportunity for safe initialization. Safe initialization makes all values initialized in constructor visible to all readers that observed the object, regardless of whether the object was safely published or not. Java Memory Model guarantees this if all the fields in object are final and there is no leakage of the under-initialized object from constructor. This is an example of safe initialization:

public class SafeSingleton implements Singleton {
  final Object obj1;
  final Object obj2;
  final Object obj3;
  final Object obj4;

  public SafeSingleton() {
    obj1 = new Object();
    obj2 = new Object();
    obj3 = new Object();
    obj4 = new Object();
  }
}

And this object would obviously be non-safely initialized:

public final class UnsafeSingleton implements Singleton {
  Object obj1;
  Object obj2;
  Object obj3;
  Object obj4;

  public UnsafeSingleton() {
    obj1 = new Object();
    obj2 = new Object();
    obj3 = new Object();
    obj4 = new Object();
  }
}

You can read more on final fields semantics at "All Fields Are Final". We will only highlight a few interesting implementation quirks related to final fields handling. It is educational to read the relevant part of the VM implementation. In short, we emit a trailing barrier in three cases:

A final field was written. Notice we do not care about what field was actually written, we unconditionally emit the barrier before exiting the (initializer) method. That means if you have at least one final field write, the final fields semantics extend to every other field written in constructor. That means this one can be a safe initialization given this simple VM implementation:
```
public class TrickySingleton implements Singleton {
  final Object barrier;
  Object obj1;
  Object obj2;
  Object obj3;
  Object obj4;

  public TrickySingleton() {
    barrier = new Object();
    obj1 = new Object();
    obj2 = new Object();
    obj3 = new Object();
    obj4 = new Object();
  }
  ...
}
```
Notice how "fun" it would be for the actual software: you remove the "barrier" field in the course of some cleanup and/or refactoring, and an unrelated field starts to contain garbage after a unsafe publication.
A volatile field was written on PPC. This is a courtesy of PPC implementation, not required by spec, strictly speaking. But without it, some very trivial scenarios, like unsafe publication of user-initialized AtomicInteger could see under-initialized instance of it.
A field write was detected, and -XX:+UnlockExperimentalVMOptions -XX:+AlwaysSafeConstructors was requested. This is an experimental feature used to evaluate correctness and performance impact for pending JMM Update work.

Practice

Correctness

The discussion so far has been purely theoretical: if something can break, it does not mean it would break, right? Let us try to see the breakage in the wild. The original post was done well before we had jcstress. But now that we have it we can just reuse the examples already available there.

The implementations in jcstress suite vaguely represent the candidate SingletonFactory implementations we have already outlined above. We also have two Singleton implementations, one that is safely initialized, and another one that is initialized unsafely. Let’s try to run these singleton tests on two interesting platforms: high-end x86 and commodity ARMv7.

In the results below the table cells show the number of failures per number of tries: the failure is either getting a null singleton, or improperly constructed one. Both Singleton and SingletonFactory sources are linked in the table as well.

x86

Running the jcstress tests with java -XX:+StressLCM -XX:+StressGCM -XX:-RestrictContended -jar target/jcstress.jar -t ".*singletons.*" -v -f 100 on quad-core Haswell i7-4790K, Linux x86_64, JDK 8u40 fastdebug^[1] build yields:

Table 1. x86 Singleton Correctness Tests
	Safe Singleton	Unsafe Singleton
Final Wrapper	0/3.3G	0/3.2G
Holder	0/9.9G	0/9.7G
Safe DCL	0/3.2G	0/2.8G
Safe Local DCL	0/3.2G	0/3.3G
Synchronized CL	0/3.8G	0/3.7G
Unsafe DCL	0/3.7G	43.8K/3.7G `(!)`
Unsafe Local DCL	0/3.89G	20.7K/3.8G `(!)`

A few important notes here:

We can see that the failure happens when an unsafely constructed Singleton meets the unsafely publishing SingletonFactory.This is not an advice for programming dangerously. Rather, this is an example of "defense in depth", where failing in one place does not break correctness for the entire program.
The tests are probabilistic: if we haven’t observed the failure, it does not mean the test would always pass. We might be just lucky to always have the illusion of correctness. This is certainly the case for Unsafe DCL + Safe Singleton. See the example later how it actually breaks.
x86 is Total Store Order hardware, meaning the stores are visible for all processors in some total order. That is, if compiler actually presented the program stores in the same order to hardware, we may be reasonably sure the initializing stores of the instance fields would be visible before seeing the reference to the object itself. Even if your hardware is total-store-ordered, you can not be sure the compiler would not reorder within the allowed memory model spec. If you turn off -XX:+StressGCM -XX:+StressLCM in this experiment, all cases would appear correct since the compiler did not reorder much.

ARM

The similar run with java -XX:-RestrictContended -jar target/jcstress.jar -t ".*singletons.*" -v -f 10 on quad-core Cortex-A9, Linux ARMv7, JDK 8:

Table 2. ARMv7 Singleton Correctness Tests
	Safe Singleton	Unsafe Singleton
Final Wrapper	0/99M	0/103M
Holder	0/352M	0/358M
Safe DCL	0/101M	0/105M
Safe Local DCL	0/102M	0/129M
Synchronized CL	0/129M	0/138M
Unsafe DCL	233/127M `(!)`	809/134M `(!)`
Unsafe Local DCL	0/126M	230/134M `(!)`

A few important notes here:

Again, we see the failures when an unsafely initialized singleton meets an unsafely publishing singleton factory, like in x86 case. Notice that we did not used any compiler fuzzers at this point: this is the result of actual hardware reordering, not the compiler reordering.
The case of Unsafe DCL + Safe Singleton is interesting and stands out. There is a race in Unsafe DCL factory which is responsible for breakage: we observe null singleton. This is something that strikes people as odd, but it is legal in JMM. This is another example that final-s can save you from observing the under-initialized object, but can not save you from not seeing the object itself!
Notice how doing synchronized in Unsafe DCL store does not help, contrary to layman belief it somehow magically "flushes the caches" or whatnot. Without a paired lock when reading the protected state, you are not guaranteed to see the writes preceding the lock-protected write.

Performance

We can argue about correctness all we want, but given a few correct code variants, which one is most performing? Let us answer this question by doing a few nano-benchmarks. Once we have JMH, and the SingletonFactory implementations already available for us, we can just do this:

@Fork(5)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class FinalWrapper {

    private FinalWrapperFactory factory;

    @Setup
    public void setup() {
        factory = new FinalWrapperFactory();
    }

    @Benchmark
    public Singleton first_unsafe() {
        return new FinalWrapperFactory().getInstance(SingletonUnsafe::new);
    }

    @Benchmark
    public Singleton first_safe() {
        return new FinalWrapperFactory().getInstance(SingletonSafe::new);
    }

    @Benchmark
    public Singleton steady_unsafe() {
        return factory.getInstance(SingletonUnsafe::new);
    }

    @Benchmark
    public Singleton steady_safe() {
        return factory.getInstance(SingletonSafe::new);
    }


    public static class FinalWrapperFactory {
        ...
    }

…and be done. Notice this benchmark covers two cases:

first_ cases when we request the Singleton instance off the freshly-baked SingletonFactory. This may help to estimate the upfront costs of seeding the factory, which would also include safe initialization cost, plus the "publishing" side of safe publication.
steady_ cases when SingletonFactory is already populated with Singleton instance, and we arguably go through the internal fast-paths. This case is reasonable to measure in multiple threads (under contention), to see how SingletonFactory performs under load. This will cover the "consuming" side of safe publication costs.

x86

Table 3. x86: Asking for a Singleton instance for the first time, ns/op
	Safe Singleton	Unsafe Singleton
Final Wrapper	4.096 ± 0.018	4.091 ± 0.024
Holder	2.256 ± 0.001 (not really)	2.256 ± 0.001 (not really)
Safe DCL	6.684 ± 0.015	6.687 ± 0.015
Safe Local DCL	6.668 ± 0.011	6.994 ± 0.481
Synchronized CL	2.332 ± 0.009	2.339 ± 0.006
Unsafe DCL	2.333 ± 0.007	2.335 ± 0.007
Unsafe Local DCL	2.335 ± 0.008	2.339 ± 0.006

How would one interpret these results? Arm yourself with low-level profilers, like JMH’s -prof perfasm, and you will see that:

Holder is lying to us: even though we instantiate a HolderFactory, the singleton is already initialized, and just sits there in a static field. The performance data for this idiom is unreliable.
FinalWrapper does one excess allocation of the wrapper itself, and that costs us a few CPU cycles, setting us back 2 ns per initialization.
Safe DCL does the volatile writes when storing the singleton instance, which entails a StoreLoad barrier. This sets us back 4 ns per initialization.
Synchronized CL has the synchronization removed because the factory is thread-local. This is arguably a benchmarking quirk, but we would not do anything to address it, since first invocation of factory is not that important to us.

Of course, these costs are one-time, and in the grand scheme of things, these overheads are minuscule, probably drowning in all the initialization overheads incurred in other parts of an application.

The interesting case is about calling the singleton factory continuously:

Table 4. x86: Asking for a Singleton instance continuously, ns/op
	1 thread		8 threads
	Safe Singleton	Unsafe Singleton	Safe Singleton	Unsafe Singleton
Final Wrapper	2.256 ± 0.001	2.256 ± 0.001	2.481 ± 0.007	2.485 ± 0.015
Holder	2.256 ± 0.001	2.257 ± 0.001	2.417 ± 0.005	2.415 ± 0.004
Safe DCL	2.256 ± 0.001	2.256 ± 0.001	2.481 ± 0.011	2.475 ± 0.002
Safe Local DCL	2.257 ± 0.001	2.257 ± 0.001	2.496 ± 0.007	2.495 ± 0.003
Synchronized CL	18.860 ± 0.002	18.860 ± 0.001	311.652 ± 10.522	302.346 ± 14.347
Unsafe DCL	2.257 ± 0.001	2.256 ± 0.001	2.488 ± 0.021	2.495 ± 0.006
Unsafe Local DCL	2.260 ± 0.002	2.257 ± 0.002	2.502 ± 0.019	2.497 ± 0.008

This is another interesting tidbit on x86: the Haswell processors are so fast, we mostly spend time in JMH Blackhole consuming the Singleton object. Again, arming ourselves with -prof perfasm highlights what is happening:

Synchronized CL is very slow, since it does acquiring/release of associated monitor on each getInstance() call. The cost is wild when multiple threads are going for a singleton instance, and this is why people invented double-checked locking to begin with.
Holder factory is marginally faster because once getInstance() is compiled, the class initialization check is gone — we know the Holder class is initialized — and so we load its field directly. The caveat is apparent here: we can not easily drop the Singleton instance from memory. Even if we lose all the references to the factory, the VM still needs to retain the class and its static fields.
All other factories are performing more or less the same. This is true even for Unsafe DCL, which is functionally broken. This is evidence that correctly synchronized code does not always work slower than "tricky" incorrect code.

ARM

ARMv7 is a weakly-ordered platform, and so the data gets more interesting.

Table 5. ARMv7: Asking for a Singleton instance for the first time, ns/op
	Safe Singleton	Unsafe Singleton
Final Wrapper	169.986 ± 0.670	162.707 ± 0.556
Holder	29.017 ± 0.081	29.075 ± 0.084
Safe DCL	156.421 ± 0.275	153.458 ± 0.406
Safe Local DCL	154.130 ± 0.411	150.651 ± 0.332
Synchronized CL	128.344 ± 0.435	124.435 ± 0.384
Unsafe DCL	128.515 ± 0.625	124.734 ± 0.300
Unsafe Local DCL	127.578 ± 0.307	124.968 ± 0.487

As usual, arming with -prof perfasm:^[2]

Holder is still lying to us, see above: it does not re-initialize the singleton at all.
Final Wrapper is slower because of the excess allocations, as in the x86 case.
Safe DCL is slower because of volatile writes, as in the x86 case.
The difference between Safe Singleton and Unsafe Singleton is the cost of safe initialization, e.g. the cost of the trailing barrier at the end of constructor. It isn’t that great of a cost, but it is measurable. Carefully choosing which safety guarantee to give up, one can speed up the entire thing: e.g. by using Unsafe Local DCL + Safe Singleton.

Again, these are the one-off costs, and we may choose not to care. We do care about the sustained performance though:

Table 6. ARMv7: Asking for a Singleton instance continuously, ns/op
	1 thread		4 threads
	Safe Singleton	Unsafe Singleton	Safe Singleton	Unsafe Singleton
Final Wrapper	28.223 ± 0.005	28.228 ± 0.006	28.237 ± 0.001	28.237 ± 0.005
Holder	13.526 ± 0.003	13.523 ± 0.002	13.529 ± 0.001	13.530 ± 0.001
Safe DCL	33.510 ± 0.001	33.510 ± 0.001	33.530 ± 0.001	33.530 ± 0.001
Safe Local DCL	29.400 ± 0.005	29.397 ± 0.003	29.412 ± 0.001	29.412 ± 0.001
Synchronized CL	77.679 ± 0.176	77.560 ± 0.018	1237.716 ± 31.500	1291.585 ± 63.876
Unsafe DCL	26.940 ± 0.130	26.957 ± 0.133	26.936 ± 0.119	26.940 ± 0.122
Unsafe Local DCL	30.573 ± 0.003	30.573 ± 0.003	30.588 ± 0.001	30.588 ± 0.001

Oh, I like that, it now starts to be quantifiable with a naked eye:

Synchronized CL behaves as expected: locks are not cheap, and once you contend on them, they don’t get any cheaper.
Holder factory leads the way big time: we do not do any volatile reads, we just reach for the static field, and return.
Safe DCL factories do the volatile reads, which entail hardware memory barriers on ARMv7. One can estimate the impact of these barriers by looking at the difference between Safe DCL and Safe Local DCL — saving a volatile read helps performance a bit! While this should not be a concern for most programs, in some high-performance cases this might be a deal-breaker!
Unsafe DCL is even faster, because no volatile reads are happening at all. Remember, most tests involving Unsafe DCL are functionally broken. Those 3-6 ns difference between Unsafe DCL and Safe DCL seems to be a fair price for correctness.
Final Wrapper does not do barriers, but it does an additional memory dereference going for the wrapper, and thus gains almost nothing compared to Safe DCL.

Conclusion

Safe Publication and Safe Initialization idioms are great abstractions over Java Memory Model rules. Used carefully, these idioms can be used as building blocks for constructing larger concurrent programs without breaking correctness. The performance costs for these idioms are usually drowned in all other costs. In the singleton examples above, while the costs are measurable, the cost of allocation, or the cost of additional memory dereference may greatly offset the cost of the safe initialization/publication pattern itself.

Safe Initialization does not magically save you from all the races. If you are doing the racy reads of otherwise safely initialized objects, something can still go haywire. Racy reads themselves are almost pure evil, even on the platforms that are very forgiving, like x86.

And please, don’t you ever again come to me claiming DCL is broken, or you will use "synchronized" in this construction because your are generally afraid of "volatiles" and can’t reason about them. This is where you will waste your precious nanoseconds, and many billions of your nanobucks.

1. Fastdebug build is needed to gain access to -XX:+StressLCM -XX:+StressGCM instruction scheduling fuzzers in Hotspot. You may reproduce the same effect on a production VM with having a beefier Singleton with more fields, so that at least one of the initializing stores spill over the reference publication. If you need an example of that, refer to the sample jcstress test, or the sample result. It is more consistent and convenient to use fuzzers for the sake of this demonstration.

2. The output is not available, sorry, since showing disassembly from ARM port is arguably not a Fair Use. You have to believe me on performance assessment.