Aleksey Shipilёv, @shipilev, aleksey@shipilev.net
THIS IS THE WORK IN PROGRESS: Although it appears logically consistent at this point, it is still not final and lacks the proper peer review. Use with caution.
Preface
Java Memory Model overhaul raises all sorts of questions. To answer them, we sometimes need to have performance experiments to guide us through pro-et-contra for a particular proposal. Of course, other things matter as well: code reliability, learning efforts, reasoning simplicity, etc. usually prevail. This post continues exploring some of the early questions.
As a good tradition, we will take some diversions into benchmarking methodology, so that even though the post itself is targeted to platform people, the non-platform people can still learn a few tricks. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.
Access Atomicity
Specification requirements
There are different notions of atomicity in the multi-threaded world. The most known notion of atomicity belong to "operations", e.g. whether the group of operations is observable only in two states: executed completely, or not yet executed. There are few other "flavors" of atomicity, and we will focus on particular one: access atomicity.
Access to a variable is atomic, if the result of that access appears indivisible. This seems like an intuitive thing to most programmers, but it may be hard to implement on every target architecture, so you sometimes need to make a trade-off. The notorious example of the trade-off like this is the exception for the atomicity of long
and double
variables in Java Memory Model. That is, this program:
long x = 0L;
void thread1() {
x = -1L;
}
void thread2() {
println(x);
}
…is allowed to print "0", "-1", or any other transient value. JMM provides the escape hatch from this behavior by allowing users to set volatile
modifier over the field, which regains atomicity. Therefore, this program is required to print either "0" or "1":
volatile long x = 0L;
void thread1() {
x = -1L;
}
void thread2() {
println(x);
}
Implementation Specifics
In the absence of native read/write operations of required widths, the implementation has to use other available means to fit the language semantics. This may include using the extended instruction sets to get the atomic accesses (e.g. x87 FPU, SSE, VFP, or other vector instructions), resorting to CAS-ed writes, or even locking.
Since these substitutions are not required to provide the memory ordering guarantees, it seems convenient for the implementation to decouple emitting the access instruction sequences and the memory semantics (e.g. emitting the barriers). This makes overloading the volatile
more conflating: users can’t request for access atomicity without getting the memory barriers as the add-on; and conversely, can’t eliminate the memory barriers around the volatile
access without giving up the access atomicity.
Hence, it forms the quantitative performance model: normal access should be faster than access-atomic access, which in turn should be faster than volatile access. Qualitative experiments should therefore answer two questions:
What is the cost difference of normal and access-atomic accesses? If this difference is small, then it seems prudent to enforce the access atomicity for all types.
What is the cost difference between access-atomic and volatile accesses? If this difference is large, then it seems to reinforce the idea to enforce the access atomicity.
Experimental Setup
Platforms
Since we are dealing with hardware support issues, we need to have wide range of architectures and microarchitectures. This pushes us to run the tests on at least five interesting platforms:
x86 (Ivy Bridge): 2 socket, 12 core, 2 hyperthreaded, Xeon E5-2697v2, Ivy Bridge, 2.7 GHz
x86 (Atom): 1 socket, 1 core, 2 hyperthreaded, Atom Z530, 1.6 Ghz
ARMv6: 1 socket, 1 core, Broadcom BCM2835 SoC (Raspberry Pi), 0.7 GHz
ARMv7: 1 socket, 4 core, Exynos 4412 Prime (Odroid-U2), Cortex-A9, 1.7 GHz
POWERv6: 1 socket, 8 core, Freescale P4080, e500mc, 1.5 GHz
We are going to use the JDK 8 source tree as the baseline for the experiments. This includes the Java SE 8 Embedded ARM/PPC ports which are not available in OpenJDK, but can be downloaded from Oracle website. Lucky for us, all the required changes are doable on machine-independent parts available in OpenJDK.
Experimental VM changes
Luckily for us, the mechanism to provide double/long atomicity is already available in current HotSpot VM, and that mechanism is already decoupled from other memory model schematics, i.e. barriers. This simple experimental VM change was enough to gain the atomicity unconditionally in both C2[1] and C1[2] HotSpot compilers (JDK-8033380):
--- old/src/share/vm/c1/c1_LIRGenerator.cpp 2014-02-11 21:29:45.730836748 +0400
+++ new/src/share/vm/c1/c1_LIRGenerator.cpp 2014-02-11 21:29:45.566836744 +0400
@@ -1734,7 +1734,8 @@
(info ? new CodeEmitInfo(info) : NULL));
}
- if (is_volatile && !needs_patching) {
+ bool needs_atomic_access = is_volatile || AlwaysAtomicAccesses;
+ if (needs_atomic_access && !needs_patching) {
volatile_field_store(value.result(), address, info);
} else {
LIR_PatchCode patch_code = needs_patching ? lir_patch_normal : lir_patch_none;
@@ -1807,7 +1808,8 @@
address = generate_address(object.result(), x->offset(), field_type);
}
- if (is_volatile && !needs_patching) {
+ bool needs_atomic_access = is_volatile || AlwaysAtomicAccesses;
+ if (needs_atomic_access && !needs_patching) {
volatile_field_load(address, reg, info);
} else {
LIR_PatchCode patch_code = needs_patching ? lir_patch_normal : lir_patch_none;
--- old/src/share/vm/c1/c1_Runtime1.cpp 2014-02-11 21:29:46.342836763 +0400
+++ new/src/share/vm/c1/c1_Runtime1.cpp 2014-02-11 21:29:46.178836759 +0400
@@ -809,11 +809,10 @@
int bci = vfst.bci();
Bytecodes::Code code = caller_method()->java_code_at(bci);
-#ifndef PRODUCT
// this is used by assertions in the access_field_patching_id
BasicType patch_field_type = T_ILLEGAL;
-#endif // PRODUCT
bool deoptimize_for_volatile = false;
+ bool deoptimize_for_atomic = false;
int patch_field_offset = -1;
KlassHandle init_klass(THREAD, NULL); // klass needed by load_klass_patching code
KlassHandle load_klass(THREAD, NULL); // klass needed by load_klass_patching code
@@ -839,11 +838,17 @@
// is the path for patching field offsets. load_klass is only
// used for patching references to oops which don't need special
// handling in the volatile case.
+
deoptimize_for_volatile = result.access_flags().is_volatile();
-#ifndef PRODUCT
+ // If we are patching a field which should be atomic, then
+ // the generated code is not correct either, force deoptimizing.
+ // We need to only cover T_LONG and T_DOUBLE fields, as we can
+ // break access atomicity only for them.
+
patch_field_type = result.field_type();
-#endif
+ deoptimize_for_atomic = (AlwaysAtomicAccesses && (patch_field_type == T_DOUBLE || patch_field_type == T_LONG));
+
} else if (load_klass_or_mirror_patch_id) {
Klass* k = NULL;
switch (code) {
@@ -918,13 +923,19 @@
ShouldNotReachHere();
}
- if (deoptimize_for_volatile) {
- // At compile time we assumed the field wasn't volatile but after
- // loading it turns out it was volatile so we have to throw the
+ if (deoptimize_for_volatile || deoptimize_for_atomic) {
+ // At compile time we assumed the field wasn't volatile/atomic but after
+ // loading it turns out it was volatile/atomic so we have to throw the
// compiled code out and let it be regenerated.
if (TracePatching) {
- tty->print_cr("Deoptimizing for patching volatile field reference");
+ if (deoptimize_for_volatile) {
+ tty->print_cr("Deoptimizing for patching volatile field reference");
+ }
+ if (deoptimize_for_atomic) {
+ tty->print_cr("Deoptimizing for patching atomic field reference");
+ }
}
+
// It's possible the nmethod was invalidated in the last
// safepoint, but if it's still alive then make it not_entrant.
nmethod* nm = CodeCache::find_nmethod(caller_frame.pc());
--- old/src/share/vm/opto/parse3.cpp 2014-02-11 21:29:46.890836776 +0400
+++ new/src/share/vm/opto/parse3.cpp 2014-02-11 21:29:46.734836772 +0400
@@ -233,7 +233,8 @@
// Build the load.
//
MemNode::MemOrd mo = is_vol ? MemNode::acquire : MemNode::unordered;
- Node* ld = make_load(NULL, adr, type, bt, adr_type, mo, is_vol);
+ bool needs_atomic_access = is_vol || AlwaysAtomicAccesses;
+ Node* ld = make_load(NULL, adr, type, bt, adr_type, mo, needs_atomic_access);
// Adjust Java stack
if (type2size[bt] == 1)
@@ -314,7 +315,8 @@
}
store = store_oop_to_object(control(), obj, adr, adr_type, val, field_type, bt, mo);
} else {
- store = store_to_memory(control(), adr, val, bt, adr_type, mo, is_vol);
+ bool needs_atomic_access = is_vol || AlwaysAtomicAccesses;
+ store = store_to_memory(control(), adr, val, bt, adr_type, mo, needs_atomic_access);
}
// If reference is volatile, prevent following volatiles ops from
--- old/src/share/vm/runtime/globals.hpp 2014-02-11 21:29:47.466836790 +0400
+++ new/src/share/vm/runtime/globals.hpp 2014-02-11 21:29:47.282836785 +0400
@@ -3859,6 +3859,9 @@
"Allocation less than this value will be allocated " \
"using malloc. Larger allocations will use mmap.") \
\
+ experimental(bool, AlwaysAtomicAccesses, false, \
+ "Accesses to all variables should always be atomic") \
+ \
product(bool, EnableTracing, false, \
"Enable event-based tracing") \
\
However, due to the specifics in instruction selection in both compilers, it may generate different code sequences even for the plain reads and writes. We can quantify it more rigorously further. So far this VM change allows us to enforce the atomicity for all types with -XX:+UnlockExperimentalVMOptions -XX:+AlwaysAtomicAccesses
.
Correctness Results
We use jcstress to validate if our changes are indeed regaining the atomicity. By the nature of functional testing, you can’t confirm the atomicity regained, but you can estimate the probability of breaking the atomicity. jcstress has bundled tests for our line of testing:
$ java -jar jcstress.jar -t ".*atomicity.primitives.plain.(Long|Double).*" -jvmArgs "-server -XX:+UnlockExperimentalVMOptions -XX:+AlwaysAtomicAccesses" -v -time 30000 -iters 20
To make the runs quicker, we only target long
/double
tests, and only their non-volatile
variants.
ARMv6 (32-bit)
Since this platform is uniprocessor, it seems useless to run the correctness tests there.
ARMv7 (32-bit)
There are four cores on our stock ARMv7, and so concurrency testing becomes sensible. Below are the results we got on our builds. If you are not familiar with the tests, here’s a quick overview how to read the results: we write X/Y to signify there are X exceptional cases were observed over Y runs. In our case, the exceptional case is the breakage of atomicity. The tests are statistical, so you can only prove the implementation is not atomic, you can not prove the implementation always is atomic.