Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

Note	This post is also available in ePUB and mobi.

THIS IS THE WORK IN PROGRESS: Although it appears logically consistent at this point, it is still not final and lacks the proper peer review. Use with caution.

Preface

Java Memory Model overhaul raises all sorts of questions. To answer them, we sometimes need to have performance experiments to guide us through pro-et-contra for a particular proposal. Of course, other things matter as well: code reliability, learning efforts, reasoning simplicity, etc. usually prevail. This post continues exploring some of the early questions.

As a good tradition, we will take some diversions into benchmarking methodology, so that even though the post itself is targeted to platform people, the non-platform people can still learn a few tricks. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.

Access Atomicity

Specification requirements

There are different notions of atomicity in the multi-threaded world. The most known notion of atomicity belong to "operations", e.g. whether the group of operations is observable only in two states: executed completely, or not yet executed. There are few other "flavors" of atomicity, and we will focus on particular one: access atomicity.

Access to a variable is atomic, if the result of that access appears indivisible. This seems like an intuitive thing to most programmers, but it may be hard to implement on every target architecture, so you sometimes need to make a trade-off. The notorious example of the trade-off like this is the exception for the atomicity of long and double variables in Java Memory Model. That is, this program:

 long x = 0L;
 void thread1() {
    x = -1L;
 }
 void thread2() {
    println(x);
 }

…is allowed to print "0", "-1", or any other transient value. JMM provides the escape hatch from this behavior by allowing users to set volatile modifier over the field, which regains atomicity. Therefore, this program is required to print either "0" or "1":

 volatile long x = 0L;
 void thread1() {
    x = -1L;
 }
 void thread2() {
    println(x);
 }

Implementation Specifics

In the absence of native read/write operations of required widths, the implementation has to use other available means to fit the language semantics. This may include using the extended instruction sets to get the atomic accesses (e.g. x87 FPU, SSE, VFP, or other vector instructions), resorting to CAS-ed writes, or even locking.

Since these substitutions are not required to provide the memory ordering guarantees, it seems convenient for the implementation to decouple emitting the access instruction sequences and the memory semantics (e.g. emitting the barriers). This makes overloading the volatile more conflating: users can’t request for access atomicity without getting the memory barriers as the add-on; and conversely, can’t eliminate the memory barriers around the volatile access without giving up the access atomicity.

Hence, it forms the quantitative performance model: normal access should be faster than access-atomic access, which in turn should be faster than volatile access. Qualitative experiments should therefore answer two questions:

What is the cost difference of normal and access-atomic accesses? If this difference is small, then it seems prudent to enforce the access atomicity for all types.
What is the cost difference between access-atomic and volatile accesses? If this difference is large, then it seems to reinforce the idea to enforce the access atomicity.

Experimental Setup

Platforms

Since we are dealing with hardware support issues, we need to have wide range of architectures and microarchitectures. This pushes us to run the tests on at least five interesting platforms:

x86 (Ivy Bridge): 2 socket, 12 core, 2 hyperthreaded, Xeon E5-2697v2, Ivy Bridge, 2.7 GHz
x86 (Atom): 1 socket, 1 core, 2 hyperthreaded, Atom Z530, 1.6 Ghz
ARMv6: 1 socket, 1 core, Broadcom BCM2835 SoC (Raspberry Pi), 0.7 GHz
ARMv7: 1 socket, 4 core, Exynos 4412 Prime (Odroid-U2), Cortex-A9, 1.7 GHz
POWERv6: 1 socket, 8 core, Freescale P4080, e500mc, 1.5 GHz

We are going to use the JDK 8 source tree as the baseline for the experiments. This includes the Java SE 8 Embedded ARM/PPC ports which are not available in OpenJDK, but can be downloaded from Oracle website. Lucky for us, all the required changes are doable on machine-independent parts available in OpenJDK.

Experimental VM changes

Luckily for us, the mechanism to provide double/long atomicity is already available in current HotSpot VM, and that mechanism is already decoupled from other memory model schematics, i.e. barriers. This simple experimental VM change was enough to gain the atomicity unconditionally in both C2^[1] and C1^[2] HotSpot compilers (JDK-8033380):

--- old/src/share/vm/c1/c1_LIRGenerator.cpp  2014-02-11 21:29:45.730836748 +0400
+++ new/src/share/vm/c1/c1_LIRGenerator.cpp 2014-02-11 21:29:45.566836744 +0400
@@ -1734,7 +1734,8 @@
                 (info ? new CodeEmitInfo(info) : NULL));
   }

-  if (is_volatile && !needs_patching) {
+  bool needs_atomic_access = is_volatile || AlwaysAtomicAccesses;
+  if (needs_atomic_access && !needs_patching) {
     volatile_field_store(value.result(), address, info);
   } else {
     LIR_PatchCode patch_code = needs_patching ? lir_patch_normal : lir_patch_none;
@@ -1807,7 +1808,8 @@
     address = generate_address(object.result(), x->offset(), field_type);
   }

-  if (is_volatile && !needs_patching) {
+  bool needs_atomic_access = is_volatile || AlwaysAtomicAccesses;
+  if (needs_atomic_access && !needs_patching) {
     volatile_field_load(address, reg, info);
   } else {
     LIR_PatchCode patch_code = needs_patching ? lir_patch_normal : lir_patch_none;
--- old/src/share/vm/c1/c1_Runtime1.cpp 2014-02-11 21:29:46.342836763 +0400
+++ new/src/share/vm/c1/c1_Runtime1.cpp 2014-02-11 21:29:46.178836759 +0400
@@ -809,11 +809,10 @@
   int bci = vfst.bci();
   Bytecodes::Code code = caller_method()->java_code_at(bci);

-#ifndef PRODUCT
   // this is used by assertions in the access_field_patching_id
   BasicType patch_field_type = T_ILLEGAL;
-#endif // PRODUCT
   bool deoptimize_for_volatile = false;
+  bool deoptimize_for_atomic = false;
   int patch_field_offset = -1;
   KlassHandle init_klass(THREAD, NULL); // klass needed by load_klass_patching code
   KlassHandle load_klass(THREAD, NULL); // klass needed by load_klass_patching code
@@ -839,11 +838,17 @@
     // is the path for patching field offsets.  load_klass is only
     // used for patching references to oops which don't need special
     // handling in the volatile case.
+
     deoptimize_for_volatile = result.access_flags().is_volatile();

-#ifndef PRODUCT
+    // If we are patching a field which should be atomic, then
+    // the generated code is not correct either, force deoptimizing.
+    // We need to only cover T_LONG and T_DOUBLE fields, as we can
+    // break access atomicity only for them.
+
     patch_field_type = result.field_type();
-#endif
+    deoptimize_for_atomic = (AlwaysAtomicAccesses && (patch_field_type == T_DOUBLE || patch_field_type == T_LONG));
+
   } else if (load_klass_or_mirror_patch_id) {
     Klass* k = NULL;
     switch (code) {
@@ -918,13 +923,19 @@
     ShouldNotReachHere();
   }

-  if (deoptimize_for_volatile) {
-    // At compile time we assumed the field wasn't volatile but after
-    // loading it turns out it was volatile so we have to throw the
+  if (deoptimize_for_volatile || deoptimize_for_atomic) {
+    // At compile time we assumed the field wasn't volatile/atomic but after
+    // loading it turns out it was volatile/atomic so we have to throw the
     // compiled code out and let it be regenerated.
     if (TracePatching) {
-      tty->print_cr("Deoptimizing for patching volatile field reference");
+      if (deoptimize_for_volatile) {
+        tty->print_cr("Deoptimizing for patching volatile field reference");
+      }
+      if (deoptimize_for_atomic) {
+        tty->print_cr("Deoptimizing for patching atomic field reference");
+      }
     }
+
     // It's possible the nmethod was invalidated in the last
     // safepoint, but if it's still alive then make it not_entrant.
     nmethod* nm = CodeCache::find_nmethod(caller_frame.pc());
--- old/src/share/vm/opto/parse3.cpp    2014-02-11 21:29:46.890836776 +0400
+++ new/src/share/vm/opto/parse3.cpp    2014-02-11 21:29:46.734836772 +0400
@@ -233,7 +233,8 @@
   // Build the load.
   //
   MemNode::MemOrd mo = is_vol ? MemNode::acquire : MemNode::unordered;
-  Node* ld = make_load(NULL, adr, type, bt, adr_type, mo, is_vol);
+  bool needs_atomic_access = is_vol || AlwaysAtomicAccesses;
+  Node* ld = make_load(NULL, adr, type, bt, adr_type, mo, needs_atomic_access);

   // Adjust Java stack
   if (type2size[bt] == 1)
@@ -314,7 +315,8 @@
     }
     store = store_oop_to_object(control(), obj, adr, adr_type, val, field_type, bt, mo);
   } else {
-    store = store_to_memory(control(), adr, val, bt, adr_type, mo, is_vol);
+    bool needs_atomic_access = is_vol || AlwaysAtomicAccesses;
+    store = store_to_memory(control(), adr, val, bt, adr_type, mo, needs_atomic_access);
   }

   // If reference is volatile, prevent following volatiles ops from
--- old/src/share/vm/runtime/globals.hpp    2014-02-11 21:29:47.466836790 +0400
+++ new/src/share/vm/runtime/globals.hpp    2014-02-11 21:29:47.282836785 +0400
@@ -3859,6 +3859,9 @@
           "Allocation less than this value will be allocated "              \
           "using malloc. Larger allocations will use mmap.")                \
                                                                             \
+  experimental(bool, AlwaysAtomicAccesses, false,                           \
+          "Accesses to all variables should always be atomic")              \
+                                                                            \
   product(bool, EnableTracing, false,                                       \
           "Enable event-based tracing")                                     \
                                                                             \

However, due to the specifics in instruction selection in both compilers, it may generate different code sequences even for the plain reads and writes. We can quantify it more rigorously further. So far this VM change allows us to enforce the atomicity for all types with -XX:+UnlockExperimentalVMOptions -XX:+AlwaysAtomicAccesses.

Correctness Results

We use jcstress to validate if our changes are indeed regaining the atomicity. By the nature of functional testing, you can’t confirm the atomicity regained, but you can estimate the probability of breaking the atomicity. jcstress has bundled tests for our line of testing:

$ java -jar jcstress.jar -t ".*atomicity.primitives.plain.(Long|Double).*"  -jvmArgs "-server -XX:+UnlockExperimentalVMOptions -XX:+AlwaysAtomicAccesses" -v -time 30000 -iters 20

To make the runs quicker, we only target long/double tests, and only their non-volatile variants.

ARMv6 (32-bit)

Since this platform is uniprocessor, it seems useless to run the correctness tests there.

ARMv7 (32-bit)

There are four cores on our stock ARMv7, and so concurrency testing becomes sensible. Below are the results we got on our builds. If you are not familiar with the tests, here’s a quick overview how to read the results: we write X/Y to signify there are X exceptional cases were observed over Y runs. In our case, the exceptional case is the breakage of atomicity. The tests are statistical, so you can only prove the implementation is not atomic, you can not prove the implementation always is atomic.

Table 1. Atomicity on ARMv7

0xe76af353: vmovsd 0x10(%ebx),%xmm0 ; reader: read 64-bit atomically 0xe76af358: vmovd %xmm0,%ebp ; reader: read lo word 0xe76af35c: vpsrlq $0x20,%xmm0,%xmm0 ; reader: reshuffle 0xe76af361: vmovd %xmm0,%edi ; reader: read hi word 0xe76af368: vmovd %ebp,%xmm0 ; writer: write lo word 0xe76af36c: vmovd %edi,%xmm1 ; writer: write hi word 0xe76af370: vpunpckldq %xmm1,%xmm0,%xmm0 ; writer: pack (lo, hi) 0xe76af374: vmovsd %xmm0,0x8(%ebx) ; writer: write 64-bit atomically

Benchmarks

Now that we established the trust on the functional side of things, we need to go for benchmarks to answer the questions we are after. Measuring the performance on nano-second scale is very tough, and we need to divert a bit from the usual methodology, because the infrastructural overheads eat through the payload.

This is the benchmark we are using. Notable things about the benchmark are:

It does not use any implicit or explicit Blackholes, because the overheads of doing additional reads and writes consume all the payload which is just a single load/store. Instead, we piggyback on @State field stores to be not elided since state object is escaped. Since we are playing hard, we double-check in assembly the stores are actually there in the hot loops.
On the reader side, we measure the cost of reading the source field of type T, and writing it back into the sink of type T. This mixes operations a bit when source and sink fields are not the same access type, but this is the lesser evil, comparing to the cost of other ways of dealing with dead code elimination.
On the writer side, we measure the cost of storing the constant into the source field of type T. Since the constant writes are not dependent in inputs, they may be commoned across benchmark calls. We have to check the disassembly to make sure the stores are there in the hot loops.

ARMv6 (32-bit)

Without further ado, let’s dive in to ARMv6 results (raw data here):

There is no C2 compiler for ARMv6, only C1.
int and float tests providee consistent performance, and nothing is affected by new feature.
double is not affected, since it is already atomic.
It is remarkable that volatile operations appear at the same scale as normal loads and stores. This is due to compiler emitting no barriers since the runtime detected it runs on uniprocessor.
Forcing atomicity over long-s sometimes improves the performance a bit since we select better instructions to access the value.

ARMv7 (32-bit)

ARMv7 is more modern, and more consistent (raw data here):

As expected, the changes do not touch any float or int results, since the operations are already atomic there.
double tests are also not affected, since the atomicity is de-facto there.
In C1, long tests are improving because we switch to another ISA to access the values. Not only this provides the access atomicity, it also improves performance.
In C2, long tests are degrading because we switch to another ISA to access the 64-bit values, even though it is not required for atomicity. This is a code generation quirk.
The difference between access-atomic reads/writes and volatile ones is very significant: up to 2x difference.

x86 Ivy Bridge (64-bit)

Let’s start with 64-bit x86. There is NO performance difference whatsoever, and all accesses are already atomic (raw data here):

Note the huge difference between already-atomic plain access and its volatile counter-part: 10x more costly access in order to have de jure access atomicity when you have one de facto!

x86 Ivy Bridge (32-bit)

32-bit x86 builds starts to behave (raw data here):

As expected, the changes do not touch any float or int results, since the operations are already atomic there.
double tests are also not affected, since the atomicity is de-facto there, since we go though FPU code.
In C1, long tests are degrading significantly, since we generate the suboptimal machine code. This is additionally verified by C2 results, which uses more efficent codegen choices.
In C2, long tests are in tie, but not that much, since we are going through SSE code, which is much faster.
Note the difference between atomic plain accesses and volatile ones: it is as much as 5x.

x86 Atom (32-bit)

Ivy Bridge results provides the interesting result for the hardware which can do multiple memory accesses at once, and hence can amortize the access costs. Atom has no luxury like that, and so makes an interesting platform to measure. It makes it double-interesting, because the generated code for both Ivy Bridge and Atom for this test is equivalent (raw data here):

double, float, int results are still not affected, which is good.
In C1, long results are degrading more than 2x: the cost of additional atomic long write dominates the cost even that of the volatile store (although without the global effects on coherency). This is most probable due to inefficient code generation which moves the SSE operands over GP registers, and on Atom this transfer hurts.
In C2, long results are also degrading more than 2x: the additional atomic store takes additional few cycles, again, probably due to SSE-GP transfer.
Here, the difference between access-atomic plain ops and their volatile variants is subtle, but volatile still being a bit slower.

PowerPC e500mc (32-bit)

The last, but not the least, PowerPC (raw data here):

double, float, int results are still not affected, which is expected.
C2 build is not yet available, and so we only have C1 results.
The access-atomic long accesses are 2x slower than their non-atomic counterparts, regardless of whether it is read or write.
Note the huge difference between access-atomic plain and volatile accesses: the volatile stores are 4x slower than access-atomic ones!

Conclusions and Future Research

The hits from putting volatile to enforce spec-mandated atomic accesses are large, because we conflate the access atomicity semantics with memory synchronization issues.

On the other hand, access atomicity seems to be already enforced beyond the provisions of spec in most implementations. In other cases, the fallback strategies incur bearable overheads, and in some cases even improve the performance! In many cases, the performance issues are due to the code generation deficiencies which are arguably fixable. It is still an open question if the provided instruction sequences are actually atomic.

Of course, the costs of enforcing the atomicity for 64-bit types can be completely subsumed by other operations in the instruction stream (volatile accesses can’t be, easily). We will have more wide-scale performance testing once experimental VM feature (JDK-8033380) integrates into the mainline builds.

1. "server" mode is frequently used as the alias for C2 compiler, but that is messy since TieredCompilation is involved: both C1 and C2 take place in tiered compilation for "server" mode.

2. Ditto, this is "client" mode compiler.

3. Don’t see the assembly code? This is because Java SE Embedded is not open-source, and so publishing the generated code disassembly is arguably not the fair use. You will have to trust me on what I saw, while that makes the claim actually unverifiable. Oh, the humanity!

All Accesses Are Atomic

Preface

Access Atomicity

Specification requirements

Implementation Specifics

Experimental Setup

Platforms

Experimental VM changes

Correctness Results

ARMv6 (32-bit)

ARMv7 (32-bit)

x86 Ivy Bridge (32-bit)

x86 Ivy Bridge (64-bit)

x86 Atom (32-bit)

PowerPC e500mc (32-bit)

Benchmarks

ARMv6 (32-bit)

ARMv7 (32-bit)

x86 Ivy Bridge (64-bit)

x86 Ivy Bridge (32-bit)

x86 Atom (32-bit)

PowerPC e500mc (32-bit)

Conclusions and Future Research