Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

Note
This post is also available in ePUB and mobi.

THIS IS THE WORK IN PROGRESS: Although it appears logically consistent at this point, it is still not final and lacks the proper peer review. Use with caution.

Preface

Java Memory Model overhaul raises all sorts of questions. To answer them, we sometimes need to have performance experiments to guide us through pro-et-contra for a particular proposal. Of course, other things matter as well: code reliability, learning efforts, reasoning simplicity, etc. usually prevail. This post continues exploring some of the early questions.

As a good tradition, we will take some diversions into benchmarking methodology, so that even though the post itself is targeted to platform people, the non-platform people can still learn a few tricks. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.

Access Atomicity

Specification requirements

There are different notions of atomicity in the multi-threaded world. The most known notion of atomicity belong to "operations", e.g. whether the group of operations is observable only in two states: executed completely, or not yet executed. There are few other "flavors" of atomicity, and we will focus on particular one: access atomicity.

Access to a variable is atomic, if the result of that access appears indivisible. This seems like an intuitive thing to most programmers, but it may be hard to implement on every target architecture, so you sometimes need to make a trade-off. The notorious example of the trade-off like this is the exception for the atomicity of long and double variables in Java Memory Model. That is, this program:

 long x = 0L;
 void thread1() {
    x = -1L;
 }
 void thread2() {
    println(x);
 }

…​is allowed to print "0", "-1", or any other transient value. JMM provides the escape hatch from this behavior by allowing users to set volatile modifier over the field, which regains atomicity. Therefore, this program is required to print either "0" or "1":

 volatile long x = 0L;
 void thread1() {
    x = -1L;
 }
 void thread2() {
    println(x);
 }

Implementation Specifics

In the absence of native read/write operations of required widths, the implementation has to use other available means to fit the language semantics. This may include using the extended instruction sets to get the atomic accesses (e.g. x87 FPU, SSE, VFP, or other vector instructions), resorting to CAS-ed writes, or even locking.

Since these substitutions are not required to provide the memory ordering guarantees, it seems convenient for the implementation to decouple emitting the access instruction sequences and the memory semantics (e.g. emitting the barriers). This makes overloading the volatile more conflating: users can’t request for access atomicity without getting the memory barriers as the add-on; and conversely, can’t eliminate the memory barriers around the volatile access without giving up the access atomicity.

Hence, it forms the quantitative performance model: normal access should be faster than access-atomic access, which in turn should be faster than volatile access. Qualitative experiments should therefore answer two questions:

  1. What is the cost difference of normal and access-atomic accesses? If this difference is small, then it seems prudent to enforce the access atomicity for all types.

  2. What is the cost difference between access-atomic and volatile accesses? If this difference is large, then it seems to reinforce the idea to enforce the access atomicity.

Experimental Setup

Platforms

Since we are dealing with hardware support issues, we need to have wide range of architectures and microarchitectures. This pushes us to run the tests on at least five interesting platforms:

  • x86 (Ivy Bridge): 2 socket, 12 core, 2 hyperthreaded, Xeon E5-2697v2, Ivy Bridge, 2.7 GHz

  • x86 (Atom): 1 socket, 1 core, 2 hyperthreaded, Atom Z530, 1.6 Ghz

  • ARMv6: 1 socket, 1 core, Broadcom BCM2835 SoC (Raspberry Pi), 0.7 GHz

  • ARMv7: 1 socket, 4 core, Exynos 4412 Prime (Odroid-U2), Cortex-A9, 1.7 GHz

  • POWERv6: 1 socket, 8 core, Freescale P4080, e500mc, 1.5 GHz

We are going to use the JDK 8 source tree as the baseline for the experiments. This includes the Java SE 8 Embedded ARM/PPC ports which are not available in OpenJDK, but can be downloaded from Oracle website. Lucky for us, all the required changes are doable on machine-independent parts available in OpenJDK.

Experimental VM changes

Luckily for us, the mechanism to provide double/long atomicity is already available in current HotSpot VM, and that mechanism is already decoupled from other memory model schematics, i.e. barriers. This simple experimental VM change was enough to gain the atomicity unconditionally in both C2[1] and C1[2] HotSpot compilers (JDK-8033380):

--- old/src/share/vm/c1/c1_LIRGenerator.cpp	2014-02-11 21:29:45.730836748 +0400
+++ new/src/share/vm/c1/c1_LIRGenerator.cpp	2014-02-11 21:29:45.566836744 +0400
@@ -1734,7 +1734,8 @@
                 (info ? new CodeEmitInfo(info) : NULL));
   }

-  if (is_volatile && !needs_patching) {
+  bool needs_atomic_access = is_volatile || AlwaysAtomicAccesses;
+  if (needs_atomic_access && !needs_patching) {
     volatile_field_store(value.result(), address, info);
   } else {
     LIR_PatchCode patch_code = needs_patching ? lir_patch_normal : lir_patch_none;
@@ -1807,7 +1808,8 @@
     address = generate_address(object.result(), x->offset(), field_type);
   }

-  if (is_volatile && !needs_patching) {
+  bool needs_atomic_access = is_volatile || AlwaysAtomicAccesses;
+  if (needs_atomic_access && !needs_patching) {
     volatile_field_load(address, reg, info);
   } else {
     LIR_PatchCode patch_code = needs_patching ? lir_patch_normal : lir_patch_none;
--- old/src/share/vm/c1/c1_Runtime1.cpp	2014-02-11 21:29:46.342836763 +0400
+++ new/src/share/vm/c1/c1_Runtime1.cpp	2014-02-11 21:29:46.178836759 +0400
@@ -809,11 +809,10 @@
   int bci = vfst.bci();
   Bytecodes::Code code = caller_method()->java_code_at(bci);

-#ifndef PRODUCT
   // this is used by assertions in the access_field_patching_id
   BasicType patch_field_type = T_ILLEGAL;
-#endif // PRODUCT
   bool deoptimize_for_volatile = false;
+  bool deoptimize_for_atomic = false;
   int patch_field_offset = -1;
   KlassHandle init_klass(THREAD, NULL); // klass needed by load_klass_patching code
   KlassHandle load_klass(THREAD, NULL); // klass needed by load_klass_patching code
@@ -839,11 +838,17 @@
     // is the path for patching field offsets.  load_klass is only
     // used for patching references to oops which don't need special
     // handling in the volatile case.
+
     deoptimize_for_volatile = result.access_flags().is_volatile();

-#ifndef PRODUCT
+    // If we are patching a field which should be atomic, then
+    // the generated code is not correct either, force deoptimizing.
+    // We need to only cover T_LONG and T_DOUBLE fields, as we can
+    // break access atomicity only for them.
+
     patch_field_type = result.field_type();
-#endif
+    deoptimize_for_atomic = (AlwaysAtomicAccesses && (patch_field_type == T_DOUBLE || patch_field_type == T_LONG));
+
   } else if (load_klass_or_mirror_patch_id) {
     Klass* k = NULL;
     switch (code) {
@@ -918,13 +923,19 @@
     ShouldNotReachHere();
   }

-  if (deoptimize_for_volatile) {
-    // At compile time we assumed the field wasn't volatile but after
-    // loading it turns out it was volatile so we have to throw the
+  if (deoptimize_for_volatile || deoptimize_for_atomic) {
+    // At compile time we assumed the field wasn't volatile/atomic but after
+    // loading it turns out it was volatile/atomic so we have to throw the
     // compiled code out and let it be regenerated.
     if (TracePatching) {
-      tty->print_cr("Deoptimizing for patching volatile field reference");
+      if (deoptimize_for_volatile) {
+        tty->print_cr("Deoptimizing for patching volatile field reference");
+      }
+      if (deoptimize_for_atomic) {
+        tty->print_cr("Deoptimizing for patching atomic field reference");
+      }
     }
+
     // It's possible the nmethod was invalidated in the last
     // safepoint, but if it's still alive then make it not_entrant.
     nmethod* nm = CodeCache::find_nmethod(caller_frame.pc());
--- old/src/share/vm/opto/parse3.cpp	2014-02-11 21:29:46.890836776 +0400
+++ new/src/share/vm/opto/parse3.cpp	2014-02-11 21:29:46.734836772 +0400
@@ -233,7 +233,8 @@
   // Build the load.
   //
   MemNode::MemOrd mo = is_vol ? MemNode::acquire : MemNode::unordered;
-  Node* ld = make_load(NULL, adr, type, bt, adr_type, mo, is_vol);
+  bool needs_atomic_access = is_vol || AlwaysAtomicAccesses;
+  Node* ld = make_load(NULL, adr, type, bt, adr_type, mo, needs_atomic_access);

   // Adjust Java stack
   if (type2size[bt] == 1)
@@ -314,7 +315,8 @@
     }
     store = store_oop_to_object(control(), obj, adr, adr_type, val, field_type, bt, mo);
   } else {
-    store = store_to_memory(control(), adr, val, bt, adr_type, mo, is_vol);
+    bool needs_atomic_access = is_vol || AlwaysAtomicAccesses;
+    store = store_to_memory(control(), adr, val, bt, adr_type, mo, needs_atomic_access);
   }

   // If reference is volatile, prevent following volatiles ops from
--- old/src/share/vm/runtime/globals.hpp	2014-02-11 21:29:47.466836790 +0400
+++ new/src/share/vm/runtime/globals.hpp	2014-02-11 21:29:47.282836785 +0400
@@ -3859,6 +3859,9 @@
           "Allocation less than this value will be allocated "              \
           "using malloc. Larger allocations will use mmap.")                \
                                                                             \
+  experimental(bool, AlwaysAtomicAccesses, false,                           \
+          "Accesses to all variables should always be atomic")              \
+                                                                            \
   product(bool, EnableTracing, false,                                       \
           "Enable event-based tracing")                                     \
                                                                             \

However, due to the specifics in instruction selection in both compilers, it may generate different code sequences even for the plain reads and writes. We can quantify it more rigorously further. So far this VM change allows us to enforce the atomicity for all types with -XX:+UnlockExperimentalVMOptions -XX:+AlwaysAtomicAccesses.

Correctness Results

We use jcstress to validate if our changes are indeed regaining the atomicity. By the nature of functional testing, you can’t confirm the atomicity regained, but you can estimate the probability of breaking the atomicity. jcstress has bundled tests for our line of testing:

$ java -jar jcstress.jar -t ".*atomicity.primitives.plain.(Long|Double).*"  -jvmArgs "-server -XX:+UnlockExperimentalVMOptions -XX:+AlwaysAtomicAccesses" -v -time 30000 -iters 20

To make the runs quicker, we only target long/double tests, and only their non-volatile variants.

ARMv6 (32-bit)

Since this platform is uniprocessor, it seems useless to run the correctness tests there.

ARMv7 (32-bit)

There are four cores on our stock ARMv7, and so concurrency testing becomes sensible. Below are the results we got on our builds. If you are not familiar with the tests, here’s a quick overview how to read the results: we write X/Y to signify there are X exceptional cases were observed over Y runs. In our case, the exceptional case is the breakage of atomicity. The tests are statistical, so you can only prove the implementation is not atomic, you can not prove the implementation always is atomic.

Table 1. Atomicity on ARMv7
-XX:-AlwaysAtomicAccesses -XX:+AlwaysAtomicAccesses

-client

-server

-client

-server

long

54K/1.75G

0/2.4G

0/1.7G

0/2.3G

double

0/1.6G

0/2.6G

0/1.6G

0/2.6G

You may notice several things there:

  • double tests appear atomic in every mode. This is because the floating-point accesses are routed through the floating-point units, and the instructions there are 64-bit wide. Hence, no action is needed to preserve atomicity there, we get it for free.

  • long tests only fail with -client, but not with -server. This is because C2 already selects atomic instruction sequences for 64-bit values accesses, while C1 does not. Hence, the performance differences between these two may highlight the potential performance issues.

ARMv7 can get the access atomicity by using extended instruction sets.[3]

x86 Ivy Bridge (32-bit)

There are two cores on our stock x86, and so concurrency testing becomes sensible. Below are the results we got on our builds.

Table 2. Atomicity on x86 Ivy Bridge running 32-bit VM
-XX:-AlwaysAtomicAccesses -XX:+AlwaysAtomicAccesses

-client

-server

-client

-server

long

45K/3.6G

42K/3.4G

0/32.9G

0/52.9G

double

0/3.3G

0/6.3G

0/3.3G

0/6.2G

You may notice several things there:

  • double tests appear atomic in every mode. This is because the floating-point accesses are routed through the FPU/SSE, and the instructions there are at least 64-bit wide. Hence, no action is needed to preserve atomicity there, we get it for free.

  • long tests only fail with -XX:-AlwaysAtomicAccesses, this is because hardware does non-atomic accesses with two 32-bit words:

  0xe7702b6f: mov    0x10(%esi),%eax
  0xe7702b72: mov    0x14(%esi),%edx
  0xe7702b75: mov    %eax,0x8(%esi)
  0xe7702b78: mov    %edx,0xc(%esi)
  • long tests regain atomicity with -XX:+AlwaysAtomicAccesses, at the cost of going for the SSE reads/writes:

  0xe76af353: vmovsd 0x10(%ebx),%xmm0       ; reader: read 64-bit atomically
  0xe76af358: vmovd  %xmm0,%ebp             ; reader: read lo word
  0xe76af35c: vpsrlq $0x20,%xmm0,%xmm0      ; reader: reshuffle
  0xe76af361: vmovd  %xmm0,%edi             ; reader: read hi word

  0xe76af368: vmovd  %ebp,%xmm0             ; writer: write lo word
  0xe76af36c: vmovd  %edi,%xmm1             ; writer: write hi word
  0xe76af370: vpunpckldq %xmm1,%xmm0,%xmm0  ; writer: pack (lo, hi)
  0xe76af374: vmovsd %xmm0,0x8(%ebx)        ; writer: write 64-bit atomically

x86 Ivy Bridge (64-bit)

In 64-bit mode, all basic types have native instructions to operate with. The correctness tests merely confirm it.

Table 3. Atomicity on x86 Ivy Bridge running 64-bit VM
-XX:-AlwaysAtomicAccesses -XX:+AlwaysAtomicAccesses

-client

-server

-client

-server

long

0/5.9G

0/6.7G

0/6.3G

0/6.3G

double

0/6.7G

0/6.6G

0/6.6G

0/6.3G

x86 Atom (32-bit)

x86 Atom is interesting platform performance-wise, because it is in-order, and have less memory bandwidth. However, in the correctness sense, it is the same as Ivy Bridge. Note the small number of failure cases, this is because the competing threads are running on single hyper-threaded code, and so the rendezvous over the racy load/store is very tricky.

Table 4. Atomicity on x86 Atom running 32-bit VM
-XX:-AlwaysAtomicAccesses -XX:+AlwaysAtomicAccesses

-client

-server

-client

-server

long

6/75M

4/98M

0/76M

0/90M

double

0/69M

0/110M

0/61M

0/98M

PowerPC e500mc (32-bit)

PowerPC is another interesting platform, and it behaves similarly to what we saw before:

  • double tests appear atomic already, since floating-point ISA is used to read/write 64 bits atomically

  • long tests are failing with -XX:-AlwaysAtomicAccesses, but pass with -XX:+AlwaysAtomicAccesses

  • There is no -server build yet for Power, but soon it will be available.

Table 5. Atomicity on PowerPC e500mc running 32-bit VM
-XX:-AlwaysAtomicAccesses -XX:+AlwaysAtomicAccesses

-client

-server

-client

-server

long

391K/1.7G

N/A

0/1.5G

N/A

double

0/1.6G

N/A

0/1.6G

N/A

PowerPC gains the access atomicity by using special instructions.[3]

Benchmarks

Now that we established the trust on the functional side of things, we need to go for benchmarks to answer the questions we are after. Measuring the performance on nano-second scale is very tough, and we need to divert a bit from the usual methodology, because the infrastructural overheads eat through the payload.

This is the benchmark we are using. Notable things about the benchmark are:

  • It does not use any implicit or explicit Blackholes, because the overheads of doing additional reads and writes consume all the payload which is just a single load/store. Instead, we piggyback on @State field stores to be not elided since state object is escaped. Since we are playing hard, we double-check in assembly the stores are actually there in the hot loops.

  • On the reader side, we measure the cost of reading the source field of type T, and writing it back into the sink of type T. This mixes operations a bit when source and sink fields are not the same access type, but this is the lesser evil, comparing to the cost of other ways of dealing with dead code elimination.

  • On the writer side, we measure the cost of storing the constant into the source field of type T. Since the constant writes are not dependent in inputs, they may be commoned across benchmark calls. We have to check the disassembly to make sure the stores are there in the hot loops.

ARMv6 (32-bit)

Without further ado, let’s dive in to ARMv6 results (raw data here):

  • There is no C2 compiler for ARMv6, only C1.

  • int and float tests providee consistent performance, and nothing is affected by new feature.

  • double is not affected, since it is already atomic.

  • It is remarkable that volatile operations appear at the same scale as normal loads and stores. This is due to compiler emitting no barriers since the runtime detected it runs on uniprocessor.

  • Forcing atomicity over long-s sometimes improves the performance a bit since we select better instructions to access the value.

arm v6.data

ARMv7 (32-bit)

ARMv7 is more modern, and more consistent (raw data here):

  • As expected, the changes do not touch any float or int results, since the operations are already atomic there.

  • double tests are also not affected, since the atomicity is de-facto there.

  • In C1, long tests are improving because we switch to another ISA to access the values. Not only this provides the access atomicity, it also improves performance.

  • In C2, long tests are degrading because we switch to another ISA to access the 64-bit values, even though it is not required for atomicity. This is a code generation quirk.

  • The difference between access-atomic reads/writes and volatile ones is very significant: up to 2x difference.

arm v7.data

x86 Ivy Bridge (64-bit)

Let’s start with 64-bit x86. There is NO performance difference whatsoever, and all accesses are already atomic (raw data here):

  • Note the huge difference between already-atomic plain access and its volatile counter-part: 10x more costly access in order to have de jure access atomicity when you have one de facto!

x86 ivy 64.data

x86 Ivy Bridge (32-bit)

32-bit x86 builds starts to behave (raw data here):

  • As expected, the changes do not touch any float or int results, since the operations are already atomic there.

  • double tests are also not affected, since the atomicity is de-facto there, since we go though FPU code.

  • In C1, long tests are degrading significantly, since we generate the suboptimal machine code. This is additionally verified by C2 results, which uses more efficent codegen choices.

  • In C2, long tests are in tie, but not that much, since we are going through SSE code, which is much faster.

  • Note the difference between atomic plain accesses and volatile ones: it is as much as 5x.

x86 ivy 32.data

x86 Atom (32-bit)

Ivy Bridge results provides the interesting result for the hardware which can do multiple memory accesses at once, and hence can amortize the access costs. Atom has no luxury like that, and so makes an interesting platform to measure. It makes it double-interesting, because the generated code for both Ivy Bridge and Atom for this test is equivalent (raw data here):

  • double, float, int results are still not affected, which is good.

  • In C1, long results are degrading more than 2x: the cost of additional atomic long write dominates the cost even that of the volatile store (although without the global effects on coherency). This is most probable due to inefficient code generation which moves the SSE operands over GP registers, and on Atom this transfer hurts.

  • In C2, long results are also degrading more than 2x: the additional atomic store takes additional few cycles, again, probably due to SSE-GP transfer.

  • Here, the difference between access-atomic plain ops and their volatile variants is subtle, but volatile still being a bit slower.

x86 atom.data

PowerPC e500mc (32-bit)

The last, but not the least, PowerPC (raw data here):

  • double, float, int results are still not affected, which is expected.

  • C2 build is not yet available, and so we only have C1 results.

  • The access-atomic long accesses are 2x slower than their non-atomic counterparts, regardless of whether it is read or write.

  • Note the huge difference between access-atomic plain and volatile accesses: the volatile stores are 4x slower than access-atomic ones!

ppc e500mc.data

Conclusions and Future Research

The hits from putting volatile to enforce spec-mandated atomic accesses are large, because we conflate the access atomicity semantics with memory synchronization issues.

On the other hand, access atomicity seems to be already enforced beyond the provisions of spec in most implementations. In other cases, the fallback strategies incur bearable overheads, and in some cases even improve the performance! In many cases, the performance issues are due to the code generation deficiencies which are arguably fixable. It is still an open question if the provided instruction sequences are actually atomic.

Of course, the costs of enforcing the atomicity for 64-bit types can be completely subsumed by other operations in the instruction stream (volatile accesses can’t be, easily). We will have more wide-scale performance testing once experimental VM feature (JDK-8033380) integrates into the mainline builds.


1. "server" mode is frequently used as the alias for C2 compiler, but that is messy since TieredCompilation is involved: both C1 and C2 take place in tiered compilation for "server" mode.
2. Ditto, this is "client" mode compiler.
3. Don’t see the assembly code? This is because Java SE Embedded is not open-source, and so publishing the generated code disassembly is arguably not the fair use. You will have to trust me on what I saw, while that makes the claim actually unverifiable. Oh, the humanity!