Aleksey Shipilёv, @shipilev, aleksey@shipilev.net
Preface
Most formalisms in Java Memory Model are usable only as the deep reference to explain what constitutes the legal outcomes in Java programs. If you don’t believe me, go and read "Java Memory Model Pragmatics" transcript again. Most people are trying to digest the JMM rules as stated, and figure out the high-level constructions that are usable without causing developers' heads to explode.
Two of these constructions are "Safe Publication" and "Safe Initialization". We will take a look at Singletons to explain both constructions. It is important to read these singleton examples as a tutorial in Java concurrency, and not as a promotion for this particular design pattern. The resurgence of lazy val
-like language constructs often forces us to go for the same pattern anyhow.
This post is a translation and update of a 2 year old post I wrote in Russian. Thanks to Gleb Smirnov, Nitsan Wakart, and Joe Kearney for reviews for content and grammar!
Theory
Singletons and Singleton Factories
It is mildly irritating when people confuse Singletons, and Singleton Factories. For the sake of our discussion, we need to clearly separate these two notions. Singleton is an object that has only a single instance at every point of program life-cycle. Singleton Factory is an object that maintains Singletons. Of course, you can conflate these two in a single implementation, but that is not the point of this post.
A reasonable SingletonFactory has a few properties:
-
It provides the public API for getting a Singleton instance.
-
It is thread-safe. No matter how many threads are requesting a Singleton, all threads will get the same Singleton instance, regardless of the current state.
-
It is lazy. One can argue about this, but non-lazy factories are not interesting for our discussion. Singleton initialization should happen with the first request for a Singleton, not when Singleton class is initialized. If no one wants a Singleton instance, it should not be instantiated.
-
It is efficient. The overheads for managing a Singleton state should be kept at minimum.
One can be reasonably sure this is not a good SingletonFactory:
public class SynchronizedCLFactory {
private Singleton instance;
public Singleton get() {
synchronized (this) {
if (instance == null) {
instance = new Singleton();
}
return instance;
}
}
}
…as it has the properties (1), (2), and (3); but lacks property (4), since we do synchronization on every access for get()
.
This observation gives rise to Double-Checked Locking idiom. It tries to evade synchronization if Singleton
is already initialized, which is a common case after the one-time synchronization. Understanding that, most people go ahead and write:
public class UnsafeDCLFactory {
private Singleton instance;
public Singleton get() {
if (instance == null) { // read 1, check 1
synchronized (this) {
if (instance == null) { // read 2, check 2
instance = new Singleton();
}
}
}
return instance; // read 3
}
}
Alas, this construction does not work properly for two reasons.
-
One could think that after "check 1" had succeeded, the
Singleton
instance is properly initialized, and we can return it. That is not correct: theSingleton
contents are only fully visible to the constructing thread! There are no guarantees that you will seeSingleton
contents properly in other threads, because you are racing against the initializer thread. Once again, even if you have observed the non-nullinstance
, it does not mean you observe its internal state properly. In JMM-ese speak, there is no happens-before between the initializing stores inSingleton
constructor, and the reads ofSingleton
fields. -
Notice that we do several reads of
instance
in this code, and at least "read 1" and "read 3" are the reads without any synchronization — that is, those reads are racy. One of the intents of the Java Memory Model is to allow reorderings for ordinary reads, otherwise the performance costs would be prohibitive. Specification-wise, as mentioned in happens-before consistency rules, a read action can observe the unordered write via the race. This is decided for each read action, regardless what other actions have already read the same location. In our example, that means that even though "read 1" could read non-nullinstance
, the code then moves on to returning it, then it does another racy read, and it can read a nullinstance
, which would be returned!
Safe Publication
Now we are going to describe the concept of safe publication. Safe publication differs from a regular publication on one crucial point. Safe publication makes all the values written before the publication visible to all readers that observed the published object. It is a great simplification over the JMM rules of engagement with regards to actions, orders and such.
There are a few trivial ways to achieve safe publication:
-
Exchange the reference through a properly locked field (JLS 17.4.5)
-
Use static initializer to do the initializing stores (JLS 12.4)
-
Exchange the reference via a volatile field (JLS 17.4.5), or as the consequence of this rule, via the AtomicX classes
-
Initialize the value into a final field (JLS 17.5).
Let us try to exploit each of those ways. The most trivial example is publishing through a properly locked field:
public class SynchronizedCLFactory {
private Singleton instance;
public Singleton get() {
synchronized(this) {
if (instance == null) {
instance = new Singleton();
}
return instance;
}
}
}
This is the most trivial example spec-wise:
-
Mutual exclusion during
get()
calls allow only a single thread to do theSingleton
initialization. -
The lock acquisition and releasing yield the synchronization actions that are bound in synchronizes-with, and then with happens-before. This forces the threads acquiring the lock to see the result of all the actions that were done before the lock release.
Classic holder idiom does roughly the same, but piggy-backs on class initialization locks. It safely publishes the object by doing the initialization in the static initializer. Note this thing is lazy, since we do not initialize Holder
until the first call to get()
:
public class HolderFactory {
public static Singleton get() {
return Holder.instance;
}
private static class Holder {
public static final Singleton instance = new Singleton();
}
}
This is how it works spec-wise:
-
Class initialization is done under the lock, as per JLS 12.4.2. Class initialization lock provides the mutual exclusion during the class initialization, that is, only a single thread can initialize the static fields.
-
The release of class initialization lock plays the necessary role in establishing the happens-before relationships between the actions in static initializers and any users of the static fields. Naively speaking, the propagation of memory effects requires any reader of static field to acquire the class initialization lock first, but JLS allows to elide that locking, if the memory effects are still maintained. Indeed, modern VMs do this optimization routinely.
Now, get back to the infamous volatile DCL, that works because now we safely publish the singleton via volatile
:
public class SafeDCLFactory {
private volatile Singleton instance;
public Singleton get() {
if (instance == null) { // check 1
synchronized(this) {
if (instance == null) { // check 2
instance = new Singleton();
}
}
}
return instance;
}
}
How can this concept be backed up by spec? Here’s how:
-
Volatile write and volatile reads of
instance
yield the actions bound in synchronizes-with order, and therefore form happens-before. That means the actions preceding the volatile store (that is, the actions in constructors) precede any actions after reading theinstance
. In other words, those threads that calledget()
will observe a fully-constructedSingleton
. -
Volatile reads and writes of
instance
yield synchronization actions that are totally ordered, and consistent with program order. Therefore two consecutive reads ofinstance
are guaranteed to see the same value, in the absence of intervening write toinstance
.
There is another way to provide the same guarantees: final fields. Since it is already too late to write to final field outside of constructor, we have to do a wrapper.
public class FinalWrapperFactory {
private FinalWrapper wrapper;
public Singleton get() {
FinalWrapper w = wrapper;
if (w == null) { // check 1
synchronized(this) {
w = wrapper;
if (w == null) { // check2
w = new FinalWrapper(new Singleton());
wrapper = w;
}
}
}
return w.instance;
}
private static class FinalWrapper {
public final Singleton instance;
public FinalWrapper(Singleton instance) {
this.instance = instance;
}
}
}
How does it map to the actual spec?
-
The constructor writing out the
final
field has a freeze action at the end. You can read more on final fields semantics in "JMM Pragmatics". In short, if all the access chains contain the freeze action in-between the initializing stores toSingleton
fields and the read of anySingleton
field, we are allowed to observe only the initializing stores. Note this means if anyone had seen theSingleton
instance before, then all bets are off: there is an access chain that bypasses the freeze action. In other words, it would be too late to "protect" already published object. -
Also notice we only do a single non-synchronized read of
wrapper
. Even though this read is racy, we recover from accidentally reading "null". If we were to readwrapper
a second time before returning, that would set us up for an opportunity to read "null" again, and then return it.
For completeness, let us do a factory that still publishes via the race, but does only a single racy read, hopefully recovering after reading "null":
public class UnsafeLocalDCLFactory implements Factory {
private Singleton instance; // deliberately non-volatile
@Override
public Singleton getInstance() {
Singleton res = instance;
if (res == null) {
synchronized (this) {
res = instance;
if (res == null) {
res = new Singleton();
instance = res;
}
}
}
return res;
}
}
The introduction of local variable here is a correctness fix, but only partial: there still no happens-before between publishing the Singleton
instance, and reading of any of its fields. We are only protecting ourselves from returning "null" instead of Singleton
instance. The same trick can also be regarded as a performance optimization for SafeDCLFactory
, i.e. doing only a single volatile read, yielding:
public class SafeLocalDCLFactory implements Factory {
private volatile Singleton instance;
@Override
public Singleton getInstance() {
Singleton res = instance;
if (res == null) {
synchronized (this) {
res = instance;
if (res == null) {
res = new Singleton();
instance = res;
}
}
}
return res;
}
}
There, these are the basic idioms for publishing.
Safe Initialization
Moving on. Java allows us to declare an object in a manner always safe for publication, that is, provides us with an opportunity for safe initialization. Safe initialization makes all values initialized in constructor visible to all readers that observed the object, regardless of whether the object was safely published or not. Java Memory Model guarantees this if all the fields in object are final
and there is no leakage of the under-initialized object from constructor. This is an example of safe initialization:
public class SafeSingleton implements Singleton {
final Object obj1;
final Object obj2;
final Object obj3;
final Object obj4;
public SafeSingleton() {
obj1 = new Object();
obj2 = new Object();
obj3 = new Object();
obj4 = new Object();
}
}
And this object would obviously be non-safely initialized:
public final class UnsafeSingleton implements Singleton {
Object obj1;
Object obj2;
Object obj3;
Object obj4;
public UnsafeSingleton() {
obj1 = new Object();
obj2 = new Object();
obj3 = new Object();
obj4 = new Object();
}
}
You can read more on final fields semantics at "All Fields Are Final". We will only highlight a few interesting implementation quirks related to final fields handling. It is educational to read the relevant part of the VM implementation. In short, we emit a trailing barrier in three cases:
-
A final field was written. Notice we do not care about what field was actually written, we unconditionally emit the barrier before exiting the (initializer) method. That means if you have at least one final field write, the final fields semantics extend to every other field written in constructor. That means this one can be a safe initialization given this simple VM implementation:
public class TrickySingleton implements Singleton { final Object barrier; Object obj1; Object obj2; Object obj3; Object obj4; public TrickySingleton() { barrier = new Object(); obj1 = new Object(); obj2 = new Object(); obj3 = new Object(); obj4 = new Object(); } ... }
Notice how "fun" it would be for the actual software: you remove the "barrier" field in the course of some cleanup and/or refactoring, and an unrelated field starts to contain garbage after a unsafe publication.
-
A volatile field was written on PPC. This is a courtesy of PPC implementation, not required by spec, strictly speaking. But without it, some very trivial scenarios, like unsafe publication of user-initialized
AtomicInteger
could see under-initialized instance of it. -
A field write was detected, and -XX:+UnlockExperimentalVMOptions -XX:+AlwaysSafeConstructors was requested. This is an experimental feature used to evaluate correctness and performance impact for pending JMM Update work.
Practice
Correctness
The discussion so far has been purely theoretical: if something can break, it does not mean it would break, right? Let us try to see the breakage in the wild. The original post was done well before we had jcstress. But now that we have it we can just reuse the examples already available there.
The implementations in jcstress suite vaguely represent the candidate SingletonFactory
implementations we have already outlined above. We also have two Singleton
implementations, one that is safely initialized, and another one that is initialized unsafely. Let’s try to run these singleton tests on two interesting platforms: high-end x86 and commodity ARMv7.
In the results below the table cells show the number of failures per number of tries: the failure is either getting a null singleton, or improperly constructed one. Both Singleton
and SingletonFactory
sources are linked in the table as well.
x86
Running the jcstress tests with
java -XX:+StressLCM -XX:+StressGCM -XX:-RestrictContended -jar target/jcstress.jar -t ".*singletons.*" -v -f 100
on quad-core Haswell i7-4790K, Linux x86_64, JDK 8u40 fastdebug[1] build yields:
Safe Singleton | Unsafe Singleton | |
---|---|---|
0/3.3G |
0/3.2G |
|
0/9.9G |
0/9.7G |
|
0/3.2G |
0/2.8G |
|
0/3.2G |
0/3.3G |
|
0/3.8G |
0/3.7G |
|
0/3.7G |
43.8K/3.7G |
|
0/3.89G |
20.7K/3.8G |
A few important notes here:
-
We can see that the failure happens when an unsafely constructed
Singleton
meets the unsafely publishingSingletonFactory
.This is not an advice for programming dangerously. Rather, this is an example of "defense in depth", where failing in one place does not break correctness for the entire program. -
The tests are probabilistic: if we haven’t observed the failure, it does not mean the test would always pass. We might be just lucky to always have the illusion of correctness. This is certainly the case for Unsafe DCL + Safe Singleton. See the example later how it actually breaks.
-
x86 is Total Store Order hardware, meaning the stores are visible for all processors in some total order. That is, if compiler actually presented the program stores in the same order to hardware, we may be reasonably sure the initializing stores of the instance fields would be visible before seeing the reference to the object itself. Even if your hardware is total-store-ordered, you can not be sure the compiler would not reorder within the allowed memory model spec. If you turn off
-XX:+StressGCM -XX:+StressLCM
in this experiment, all cases would appear correct since the compiler did not reorder much.
ARM
The similar run with java -XX:-RestrictContended -jar target/jcstress.jar -t ".*singletons.*" -v -f 10
on quad-core Cortex-A9, Linux ARMv7, JDK 8:
Safe Singleton | Unsafe Singleton | |
---|---|---|
0/99M |
0/103M |
|
0/352M |
0/358M |
|
0/101M |
0/105M |
|
0/102M |
0/129M |
|
0/129M |
0/138M |
|
233/127M |
809/134M |
|
0/126M |
230/134M |
A few important notes here:
-
Again, we see the failures when an unsafely initialized singleton meets an unsafely publishing singleton factory, like in x86 case. Notice that we did not used any compiler fuzzers at this point: this is the result of actual hardware reordering, not the compiler reordering.
-
The case of Unsafe DCL + Safe Singleton is interesting and stands out. There is a race in Unsafe DCL factory which is responsible for breakage: we observe null singleton. This is something that strikes people as odd, but it is legal in JMM. This is another example that
final
-s can save you from observing the under-initialized object, but can not save you from not seeing the object itself! -
Notice how doing
synchronized
in Unsafe DCL store does not help, contrary to layman belief it somehow magically "flushes the caches" or whatnot. Without a paired lock when reading the protected state, you are not guaranteed to see the writes preceding the lock-protected write.
Performance
We can argue about correctness all we want, but given a few correct code variants, which one is most performing? Let us answer this question by doing a few nano-benchmarks. Once we have JMH, and the SingletonFactory
implementations already available for us, we can just do this:
@Fork(5)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class FinalWrapper {
private FinalWrapperFactory factory;
@Setup
public void setup() {
factory = new FinalWrapperFactory();
}
@Benchmark
public Singleton first_unsafe() {
return new FinalWrapperFactory().getInstance(SingletonUnsafe::new);
}
@Benchmark
public Singleton first_safe() {
return new FinalWrapperFactory().getInstance(SingletonSafe::new);
}
@Benchmark
public Singleton steady_unsafe() {
return factory.getInstance(SingletonUnsafe::new);
}
@Benchmark
public Singleton steady_safe() {
return factory.getInstance(SingletonSafe::new);
}
public static class FinalWrapperFactory {
...
}
…and be done. Notice this benchmark covers two cases:
-
first_ cases when we request the
Singleton
instance off the freshly-bakedSingletonFactory
. This may help to estimate the upfront costs of seeding the factory, which would also include safe initialization cost, plus the "publishing" side of safe publication. -
steady_ cases when
SingletonFactory
is already populated withSingleton
instance, and we arguably go through the internal fast-paths. This case is reasonable to measure in multiple threads (under contention), to see howSingletonFactory
performs under load. This will cover the "consuming" side of safe publication costs.
x86
Safe Singleton | Unsafe Singleton | |
---|---|---|
4.096 ± 0.018 |
4.091 ± 0.024 |
|
2.256 ± 0.001 (not really) |
2.256 ± 0.001 (not really) |
|
6.684 ± 0.015 |
6.687 ± 0.015 |
|
6.668 ± 0.011 |
6.994 ± 0.481 |
|
2.332 ± 0.009 |
2.339 ± 0.006 |
|
2.333 ± 0.007 |
2.335 ± 0.007 |
|
2.335 ± 0.008 |
2.339 ± 0.006 |
How would one interpret these results? Arm yourself with low-level profilers, like JMH’s -prof perfasm
, and you will see that:
-
Holder is lying to us: even though we instantiate a
HolderFactory
, the singleton is already initialized, and just sits there in a static field. The performance data for this idiom is unreliable. -
FinalWrapper does one excess allocation of the wrapper itself, and that costs us a few CPU cycles, setting us back 2 ns per initialization.
-
Safe DCL does the volatile writes when storing the singleton instance, which entails a StoreLoad barrier. This sets us back 4 ns per initialization.
-
Synchronized CL has the synchronization removed because the factory is thread-local. This is arguably a benchmarking quirk, but we would not do anything to address it, since first invocation of factory is not that important to us.
Of course, these costs are one-time, and in the grand scheme of things, these overheads are minuscule, probably drowning in all the initialization overheads incurred in other parts of an application.
The interesting case is about calling the singleton factory continuously:
1 thread |
8 threads |
|||
2.256 ± 0.001 |
2.256 ± 0.001 |
2.481 ± 0.007 |
2.485 ± 0.015 |
|
2.256 ± 0.001 |
2.257 ± 0.001 |
2.417 ± 0.005 |
2.415 ± 0.004 |
|
2.256 ± 0.001 |
2.256 ± 0.001 |
2.481 ± 0.011 |
2.475 ± 0.002 |
|
2.257 ± 0.001 |
2.257 ± 0.001 |
2.496 ± 0.007 |
2.495 ± 0.003 |
|
18.860 ± 0.002 |
18.860 ± 0.001 |
311.652 ± 10.522 |
302.346 ± 14.347 |
|
2.257 ± 0.001 |
2.256 ± 0.001 |
2.488 ± 0.021 |
2.495 ± 0.006 |
|
2.260 ± 0.002 |
2.257 ± 0.002 |
2.502 ± 0.019 |
2.497 ± 0.008 |
This is another interesting tidbit on x86: the Haswell processors are so fast, we mostly spend time in JMH Blackhole consuming the Singleton object. Again, arming ourselves with -prof perfasm
highlights what is happening:
-
Synchronized CL is very slow, since it does acquiring/release of associated monitor on each
getInstance()
call. The cost is wild when multiple threads are going for a singleton instance, and this is why people invented double-checked locking to begin with. -
Holder factory is marginally faster because once
getInstance()
is compiled, the class initialization check is gone — we know theHolder
class is initialized — and so we load its field directly. The caveat is apparent here: we can not easily drop the Singleton instance from memory. Even if we lose all the references to the factory, the VM still needs to retain the class and its static fields. -
All other factories are performing more or less the same. This is true even for Unsafe DCL, which is functionally broken. This is evidence that correctly synchronized code does not always work slower than "tricky" incorrect code.
ARM
ARMv7 is a weakly-ordered platform, and so the data gets more interesting.
Safe Singleton | Unsafe Singleton | |
---|---|---|
169.986 ± 0.670 |
162.707 ± 0.556 |
|
29.017 ± 0.081 |
29.075 ± 0.084 |
|
156.421 ± 0.275 |
153.458 ± 0.406 |
|
154.130 ± 0.411 |
150.651 ± 0.332 |
|
128.344 ± 0.435 |
124.435 ± 0.384 |
|
128.515 ± 0.625 |
124.734 ± 0.300 |
|
127.578 ± 0.307 |
124.968 ± 0.487 |
As usual, arming with -prof perfasm
:[2]
-
Holder is still lying to us, see above: it does not re-initialize the singleton at all.
-
Final Wrapper is slower because of the excess allocations, as in the x86 case.
-
Safe DCL is slower because of volatile writes, as in the x86 case.
-
The difference between Safe Singleton and Unsafe Singleton is the cost of safe initialization, e.g. the cost of the trailing barrier at the end of constructor. It isn’t that great of a cost, but it is measurable. Carefully choosing which safety guarantee to give up, one can speed up the entire thing: e.g. by using Unsafe Local DCL + Safe Singleton.
Again, these are the one-off costs, and we may choose not to care. We do care about the sustained performance though:
1 thread |
4 threads |
|||
28.223 ± 0.005 |
28.228 ± 0.006 |
28.237 ± 0.001 |
28.237 ± 0.005 |
|
13.526 ± 0.003 |
13.523 ± 0.002 |
13.529 ± 0.001 |
13.530 ± 0.001 |
|
33.510 ± 0.001 |
33.510 ± 0.001 |
33.530 ± 0.001 |
33.530 ± 0.001 |
|
29.400 ± 0.005 |
29.397 ± 0.003 |
29.412 ± 0.001 |
29.412 ± 0.001 |
|
77.679 ± 0.176 |
77.560 ± 0.018 |
1237.716 ± 31.500 |
1291.585 ± 63.876 |
|
26.940 ± 0.130 |
26.957 ± 0.133 |
26.936 ± 0.119 |
26.940 ± 0.122 |
|
30.573 ± 0.003 |
30.573 ± 0.003 |
30.588 ± 0.001 |
30.588 ± 0.001 |
Oh, I like that, it now starts to be quantifiable with a naked eye:
-
Synchronized CL behaves as expected: locks are not cheap, and once you contend on them, they don’t get any cheaper.
-
Holder factory leads the way big time: we do not do any volatile reads, we just reach for the static field, and return.
-
Safe DCL factories do the volatile reads, which entail hardware memory barriers on ARMv7. One can estimate the impact of these barriers by looking at the difference between Safe DCL and Safe Local DCL — saving a volatile read helps performance a bit! While this should not be a concern for most programs, in some high-performance cases this might be a deal-breaker!
-
Unsafe DCL is even faster, because no volatile reads are happening at all. Remember, most tests involving Unsafe DCL are functionally broken. Those 3-6 ns difference between Unsafe DCL and Safe DCL seems to be a fair price for correctness.
-
Final Wrapper does not do barriers, but it does an additional memory dereference going for the wrapper, and thus gains almost nothing compared to Safe DCL.
Conclusion
Safe Publication and Safe Initialization idioms are great abstractions over Java Memory Model rules. Used carefully, these idioms can be used as building blocks for constructing larger concurrent programs without breaking correctness. The performance costs for these idioms are usually drowned in all other costs. In the singleton examples above, while the costs are measurable, the cost of allocation, or the cost of additional memory dereference may greatly offset the cost of the safe initialization/publication pattern itself.
Safe Initialization does not magically save you from all the races. If you are doing the racy reads of otherwise safely initialized objects, something can still go haywire. Racy reads themselves are almost pure evil, even on the platforms that are very forgiving, like x86.
And please, don’t you ever again come to me claiming DCL is broken, or you will use "synchronized" in this construction because your are generally afraid of "volatiles" and can’t reason about them. This is where you will waste your precious nanoseconds, and many billions of your nanobucks.