Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

Note
This post is also available in ePUB and mobi.

THIS IS THE WORK IN PROGRESS: Although it appears logically consistent at this point, it is still not final and lacks the proper peer review. Use with caution.

Preface

Java Memory Model overhaul raises all sorts of questions. To answer them, we sometimes need to have performance experiments to guide us through pro-et-contra for a particular proposal. Of course, other things matter as well: code reliability, learning efforts, reasoning simplicity, etc. usually prevail.

One of the questions came up early: What is the cost of extending the final field memory model guarantees to all fields initialized in constructors? This post tries to approach the question.

As a good tradition, we will take some diversions into benchmarking methodology, so that even though the post itself is targeted to platform people, the non-platform people can still learn a few tricks. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.

Final Fields Semantics

Specification

Final fields semantics in JLS is quite sophisticated. It is one of four pieces which comprise the Java Memory Model (others are: values read/write atomicity, happens-before semantics, well-formed executions and commit order). Final field semantics boils down to this:

  1. Def: There are memory/dereferences chains which the program uses to access the final field. The distinction between memory/dereference chains are not essential at basic level. Think about these as the indirection chains which lead us to field access.

  2. Def: There is a synthetic freeze action at the end of constructor.

  3. Rule: Accessing the field initialized in constructor via the dereference/memory chains containing freeze action shows up correctly initialized value.

  4. Corollary from (c): Having no other way to access the final field rather than through the freeze action guarantees the field is always observed to be properly constructed, regardless of who is observing it. (Failing to match this requirement, e.g. by leaking this from the constructor, opens up the chains bypassing the freeze action)

This semantics offers a great freedom for runtime to cache the final fields as soon it discovers them. The added "benefit" is that final-bearing objects are safely publishable through the data races. Racy publication is still a big NO-NO, but for final fields it is more tolerable.

Implementation Support

Since memory models are great compromise about what users can expect, and what we are actually able to pull off, any semantics needs a plausible implementation. One of the conservative implementations is described in Doug’s JSR 133 Cookbook .[1] Note the Cookbook is not the definition of the Java Memory Model, it just describes the simple set of rules the implementation may follow to match the requirements of JMM.

In short, on writer side it is enough to make sure the stores in the constructor are ordered before the publication of the object holding the field. That requires us to put the barrier between initializing stores and publishing the reference:

  C tc = <new>  // initialize object + metadata
  tc.f = 42;    // initialize final field
  [LoadStore|StoreStore]
  c = tc;       // publish

JSR 133 Cookbook only requires StoreStore, but might also require LoadStore barriers. This covers for a corner case when the final field value depends on some field (the interesting example if we do the increment of the field itself) which experiences a racy update. This corner case may result in storing the value derived by that racy update, not by constructor itself, and that would obliterate the safe construction guarantees[2].

Now, these barriers are intents and not the real things. Think about them as the synthetic actions which prevent reordering of relevant operations across the barrier. This action is processed on two levels: runtimes are prohibited from moving ops around that barrier, and also require to communicate this barrier to hardware, if needed. We will see that some hardware requires the explicit barrier, and some hardware does not.

There is a symmetric requirement on reader side: we enforce the order of the reference access, and field access itself:

  C tc = c;
  [LoadLoad]
  r1 = tc.c;

Breaking this order is not directly possible on runtime side, because it will violate the data dependencies. Note that if we somehow acquired c early, then we can speculate about its value and do the field access early, breaking the visibility of written value of tc.c. Spec specifically allows this behavior, because it means we have the indirection chain to the field access bypassing the freeze action.

Turns out, most hardware also respects the order of so-called 'dependent' reads, and hence does not require emitting the barrier there.

HotSpot Code Example

Let’s see how current implementation deals with these two constraints. Since all platforms we target already respect the dependent loads, we only concentrate on writer side. This is the relevant part of HotSpot code which emits the aforementioned barrier if we detect we have written out the final field:

 900 void Parse::do_exits() {
   ...
 910   if (wrote_final()) {
 911     // This method (which must be a constructor by the rules of Java)
 912     // wrote a final.  The effects of all initializations must be
 913     // committed to memory before any code after the constructor
 914     // publishes the reference to the newly constructor object.
 915     // Rather than wait for the publication, we simply block the
 916     // writes here.  Rather than put a barrier on only those writes
 917     // which are required to complete, we force all writes to complete.
 918     //
 919     // "All bets are off" unless the first publication occurs after a
 920     // normal return from the constructor.  We do not attempt to detect
 921     // such unusual early publications.  But no barrier is needed on
 922     // exceptional returns, since they cannot publish normally.
 923     //
 924     _exits.insert_mem_bar(Op_MemBarRelease, alloc_with_final());
 925 #ifndef PRODUCT
 926     if (PrintOpto && (Verbose || WizardMode)) {
 927       method()->print_name();
 928       tty->print_cr(" writes finals and needs a memory barrier");
 929     }
 930 #endif
 931   }
   ...

This code is the part of machinery parsing the incoming bytecode into the High-level Internal Representation (HIR, or IR). This membar materializes (or not) in the machine code, when emitter produces the machine code. It may be confusing the code appears to add MemBarRelease, but fear you not, it is actually the LoadStore|StoreStore, which makes it a release fence: no operations can float down the succeeding store.

Now, what these barriers look like on different platforms? In all cases, runtime still needs to preserve the order in the generated machine code, so there the barrier effect even if hardware does not require anything else. As far as hardware is concerned, JSR 133 Cookbook sums up the hardware specs:

  • x86 is already maintaining the Total Store Order, and needs no barriers

  • ARMv7 can freely reorder writes, hence we need dmb sy

  • PowerPC can freely reorder writes, hence we need lwsync

These differences on hardware level complicate the performance model around finals, and make our investigation job harder.

Now that we made the required introductions, it is interesting to estimate the actual cost of maintaining the final field semantics. Since this semantics is arguably usable for safe initialization even for mutable non-final objects, it poses the question: "What would be the costs of requiring the final-like semantics for all the objects in an application?"

Experimental Setup

As we figured in previous section, our performance model predicts different performance depending on the hardware. This pushes us to run the tests on at least three major platforms:

  • x86: 2 socket, 12 core, 2 hyperthreaded, Xeon E5-2697v2, Ivy Bridge, 2.7 GHz

  • ARM: 1 socket, 4 core, Exynos 4412 Prime, Cortex-A9, 1.7 GHz

  • PowerPC: 1 socket, 8 core, Freescale P4080, e500mc, 1.5 GHz

We are going to use the JDK 8 source tree as the baseline for the experiments. This includes the Java SE 8 Embedded ARM/PPC ports which are not available in OpenJDK, but can be downloaded from Oracle website. Lucky for us, all the required changes are doable on machine-independent parts available in OpenJDK.

Part I. Allocation

Rationale

It all sounds simple so far, right? Especially knowing where exactly the barriers for final fields are in HotSpot. Hack it up, run some benchmarks, and conclude the result. This is a wrong way to start: you first need to understand what you are dealing with. In that spirit, let’s construct the simple benchmark first. This benchmark has some interesting traits:

  • Since we about to figure out a performance effect on the writers' side, we implicitly need to measure object allocation. Measuring allocation without getting screwed by optimizing compiler requires us to escape new objects in order to confuse runtime into believing the object can not be the subject of, say, scalar replacement. Lucky for us, JMH already takes care of that, and we only need to return the instance back.

  • Now, when initializing the fields, how many fields we should initialize for a reliable benchmark? The answer is, if you don’t know, you should try many! We are trying 0..64 fields with exponential steps. We chose 64 because why not.

  • It is important to provide the basic level of control. That is, some things are expected not to change (these are called negative control). And some things are expected to change (this is called positive control). We have the F* classes bearing all-final fields as the negative control: if we emit the barriers everywhere, their performance should not change. We have the P* classes bearing all-plain fields as the positive control: if we emit the barriers everywhere, their performance should change to match the performance of F* classes exactly.

  • Note that the difference between P* and F* tests should already provide the insight on final fields barriers costs, but we will table that thought for a while.

Gaining the Confidence

This is the part on benchmarking quirks, and the importance of experimental control. You can skip this part right to the ARM section below.

Benchmark Infrastructure

When you first run the benchmark on Cortex-A9 vs x86, you will first notice the performances of these two are different by orders of magnitude. That is not surprising, since the platforms are really different: the production-grade server and the tiny embedded device. But what’s really frightening is that Cortex-A9 timings are so high, that changing the number of written fields drowned in the overhead. This is where it gets interesting: before you proceed, you need to tune up the measurement environment.

We could probably shove in some performance data to demonstrate, but really, this post would be overloaded soon! The goal for this section is to implant the idea that you should have the established trust in the benchmarking infrastructure. "Established" means you have to be skeptical about the infra until you prove it is correct. If you have anything surprising surfacing in the experiments, it is safe to first blame the infra.

It takes just a few sentences to declare, but in reality, the infra improvements took a full day worth of running the benchmarks, staring at disassembly, coming up with the idea, testing it, testing it does not regress anything else, run the target benchmark, going to step 1. These are the things we had to tune up in JMH to proceed with nanobenchmarking small ARMs:

To repeat: you don’t trust the infra blindly, you analyze it along with your experiment, fix it up, and only then proceed. Doing otherwise will waste your time. This is one of the reasons to use benchmark harness: it saves time when somebody else had taken care of some problems for you.

Validating the Performance Model, Try 1

Now that we allegedly established some trust in the benchmark, we can proceed to patching the VM and running the benchmark there. The code above only takes care of C2[3] parts, and we also need C1[4] parts. Hence, we are coming up with this patch:

diff -r c89630a122b4 src/share/vm/c1/c1_GraphBuilder.cpp
--- a/src/share/vm/c1/c1_GraphBuilder.cpp
+++ b/src/share/vm/c1/c1_GraphBuilder.cpp
@@ -1459,7 +1459,7 @@
       monitorexit(state()->lock_at(0), SynchronizationEntryBCI);
     }

-    if (need_mem_bar) {
+    if (need_mem_bar || AlwaysSafeConstructors) {
       append(new MemBar(lir_membar_storestore));
     }

@@ -1510,7 +1510,7 @@
     append_split(new MonitorExit(receiver, state()->unlock()));
   }

-  if (need_mem_bar) {
+  if (need_mem_bar || AlwaysSafeConstructors) {
       append(new MemBar(lir_membar_storestore));
   }

diff -r c89630a122b4 src/share/vm/opto/parse1.cpp
--- a/src/share/vm/opto/parse1.cpp
+++ b/src/share/vm/opto/parse1.cpp
@@ -907,7 +907,7 @@
   Node* iophi = _exits.i_o();
   _exits.set_i_o(gvn().transform(iophi));

-  if (wrote_final()) {
+  if (wrote_final() || AlwaysSafeConstructors) {
     // This method (which must be a constructor by the rules of Java)
     // wrote a final.  The effects of all initializations must be
     // committed to memory before any code after the constructor
diff -r c89630a122b4 src/share/vm/runtime/globals.hpp
--- a/src/share/vm/runtime/globals.hpp  Fri Jan 10 08:31:47 2014 -0800
+++ b/src/share/vm/runtime/globals.hpp  Tue Jan 14 22:28:13 2014 +0400
@@ -519,6 +519,9 @@
   develop(bool, CleanChunkPoolAsync, falseInEmbedded,                       \
           "Clean the chunk pool asynchronously")                            \
                                                                             \
+  experimental(bool, AlwaysSafeConstructors, false,                         \
+          "Force safe construction, as if all fields are final.")           \
+                                                                            \
   /* Temporary: See 6948537 */                                              \
   experimental(bool, UseMemSetInBOT, true,                                  \
           "(Unstable) uses memset in BOT updates in GC code")               \

Cross the fingers, and run it on some interesting platform, like ARM. I usually look straight at the raw numbers, but for most people, the flesh is weak. Here is the chart then (raw data here):

arm phase1.alloc data
  • The difference between final/minus and plain/minus is minuscule, 10 ns at worst, which already gives us an interesting data point.

  • Positive control PASSED: final/plus and plain/plus performances match. This is good, because it means the flag affects the workload performance. Don’t laugh at this, this may indicate you actually built your changes in, you actually used the build bits, and you haven’t overlooked some simple stupid stuff.

  • Negative control FAILED: we turned the flag up, and experienced the performance degradation: see the difference between final/minus and final/plus. This is bad, because it affects the performance in the way we did not intended it to do.

This means either the patch, or the benchmark is flawed. After some digging and going back and forth in assembly, we realize the critical mistake. It is actually evident when comparing the generated code with the flag on/off: there are some new barriers around. Lots of them. Hm.

Aha! Parse::do_exits() is called for every single method. Putting true there unconditionally emits the barrier at the end of each method, and then the code experiences the additional performance hit. Cover your bruises and move on…​

Validating the Performance Model, Try 2

So, here’s how we do it properly!

--- old/hotspot/src/share/vm/c1/c1_GraphBuilder.cpp
+++ new/hotspot/src/share/vm/c1/c1_GraphBuilder.cpp
@@ -1436,7 +1436,7 @@

   bool need_mem_bar = false;
   if (method()->name() == ciSymbol::object_initializer_name() &&
-      scope()->wrote_final()) {
+      (scope()->wrote_final() || AlwaysSafeConstructors)) {
     need_mem_bar = true;
   }

--- old/hotspot/src/share/vm/opto/parse1.cpp	2014-01-16 01:25:14.282000203 +0400
+++ new/hotspot/src/share/vm/opto/parse1.cpp	2014-01-16 01:25:14.170000209 +0400
@@ -907,7 +908,9 @@
   Node* iophi = _exits.i_o();
   _exits.set_i_o(gvn().transform(iophi));

-  if (wrote_final()) {
+  if (wrote_final() ||
+        (AlwaysSafeConstructors &&
+           method()->name() == ciSymbol::object_initializer_name())) {
     // This method (which must be a constructor by the rules of Java)
     // wrote a final.  The effects of all initializations must be
     // committed to memory before any code after the constructor

Right? Need to double-check that we actually parsing the constructor. OK then, let’s try to remeasure (raw data here):

arm phase2.alloc data

It did help a bit, but not completely, hm. Let us stare at the assembly again. Ok, most of the excess barriers are gone. But there is also the additional barrier right before we store the initial values for the fields. What’s that? Haven’t we requested the single barrier? Why there is an additional one?

Aha! That must be the Object::<init> constructor. The current code implicitly checks if constructor had actually written out any final field, and this implicitly bypasses emitting the barrier for empty constructors! This test actually highlight that HotSpot is not coalescing the barriers, and we will explore it in the next section.

Meanwhile, we workaround this issue, and move on…​

Validating the Performance Model, Try 3

Trying again, only emitting the barriers when the constructor had indeed wrote some fields:

--- old/hotspot/src/share/vm/c1/c1_GraphBuilder.cpp
+++ new/hotspot/src/share/vm/c1/c1_GraphBuilder.cpp
@@ -1436,7 +1436,7 @@

   bool need_mem_bar = false;
   if (method()->name() == ciSymbol::object_initializer_name() &&
-      scope()->wrote_final()) {
+      (scope()->wrote_final() || (AlwaysSafeConstructors && scope()->wrote_fields()))) {
     need_mem_bar = true;
   }

--- old/hotspot/src/share/vm/opto/parse1.cpp
+++ new/hotspot/src/share/vm/opto/parse1.cpp
@@ -907,7 +908,9 @@
   Node* iophi = _exits.i_o();
   _exits.set_i_o(gvn().transform(iophi));

-  if (wrote_final()) {
+  if (wrote_final() ||
+        (AlwaysSafeConstructors && wrote_fields() &&
+           method()->name() == ciSymbol::object_initializer_name())) {
     // This method (which must be a constructor by the rules of Java)
     // wrote a final.  The effects of all initializations must be
     // committed to memory before any code after the constructor

Now are we happy? The performance data suggests we are, the control is matched perfectly (see the chart below in ARM section). Let us declare success and submit the patch to OpenJDK JIRA: JDK-8031818. Note that we only needed to dive into the generated code details to get the idea of what could be wrong. The control itself is agnostic about those details. Control saved us from trusting the broken patch/benchmark, and did that just by validating the simple expected performance model.

Now that we have confirmed both the benchmark and the patch are plausible, we can (finally!) move on to measure it longer on all interesting platforms. I find it both beautiful (and disturbing) that most people on the Internet are hacking together something as crude as we had at Phase 0, wasting lots of time running it, then catching a Stockholm Syndrome, and starting to defend their approach is actually plausible. IT WAS NOT. Accept you wasted your time, and move on.

ARM

This is the final chart we have got for our ARM host (raw data):

arm phase3.alloc data

Some quick observations:

  • Both positive and negative controls are good.

  • Allocating the object is hard already. 30 ns is spent even for the empty class. There is still the barrier on the allocation path, which seems to guard the object metadata writes, see the disassembly below.

  • The overhead of placing final on the objects is rather small, within 5 ns, and only pronunciated when we write just a few fields. After that, garbage collection costs (remember, we are allocating millions of objects per second) dominate the execution. The same chart, zoomed at Y axis shows it better:

arm phase3.alloc data.zoom100

To double-check the results, let’s see the disassembled code for plain/minus in -server mode (we had cleaned up unnecessary bloat, contrasted the instance allocation path, and made a few comments).[5]

  ; store mark word
  ; store klass word
  ; put field f1
  ; put field f2
  ; put field f3
  ; put field f4
  ; StoreStore

There is a final StoreStore, and it looks like the final fields barrier. It is not, this is the object metadata barrier, which guards the values written out in the object header. It had floated all the way down before the object is published. This is a legal transformation, even though it appears as if the stores floated before the barrier: the memory effects in the IR are projected from AllocateNode to the first use, which allowed us to move the barrier down to the escape point.

The same example, but for the final fields, final/minus and -server: [5]

  ; store mark word
  ; store klass word
  ; put field f1
  ; put field f2
  ; put field f3
  ; put field f4
  ; StoreStore
  ; LoadStore|StoreStore

The same assembly is produced for final/plus and plain/plus, hooray! The performance difference is because of the excess LoadStore|StoreStore, this is our barrier emitted from MemBar_Release. The important and interesting thing to see here is that both barriers are going back-to-back, which opens up the simple way of peephole-ing both into single (StoreStore + LoadStore|StoreStore) ⇒ LoadStore|StoreStore. Which means the object header barrier can piggyback on final release barrier. We recorded this observation as JDK-8032481.

Things get interesting in -client mode. With plain/minus, we get: [5]

  ; store mark word
  ; store class word
  ; cleaning object fields
  ; StoreStore
  ; put field f4
  ; put field f3
  ; put field f2
  ; put field f1

The metadata StoreStore barrier is right after object was blankly-initialized. C1 IR is not sophisticated enough to track the memory projections, and hence the barrier sits right there we put it. For final/plus and -client you might notice the slight difference:[5]

  ; store mark word
  ; store class word
  ; cleaning object fields
  ; StoreStore
  ; put field f4
  ; put field f3
  ; put field f2
  ; put field f1
  ; StoreStore

There is a trailing StoreStore barrier, like the one we emitted in our patch. This is the "final fields" barrier, and the assembly is the same in final/plus and plain/plus modes. You might notice -client is only emitting StoreStore though. This is a gray area for JMM at this point, and both behaviors are arguably legal.

x86

This is the final chart we got for x86 (raw data):

x86 phase3.alloc data

Observations:

  • Both positive and negative controls are good.

  • Much faster than ARM, and that only makes benchmarking harder.

  • There is no statistically significant performance change in either of modes.

  • The same chart, zoomed at Y axis does not show anything interesting.

x86 phase3.alloc data.zoom10

Just to be on the safe side, let us peek into the assembly. This generated code is identical in all modes:

  0x00007fec3d075070: mov    %r10,0x60(%r15)
  0x00007fec3d075074: prefetchnta 0xc0(%r10)
  0x00007fec3d07507c: mov    $0xe04493a8,%r11d
  0x00007fec3d075082: mov    0xb0(%r12,%r11,8),%r10
  0x00007fec3d07508a: mov    %r10,(%rax)           ; store mark word
  0x00007fec3d07508d: movl   $0xe04493a8,0x8(%rax) ; store class word
  0x00007fec3d075094: movl   $0x2a,0xc(%rax)       ; put field f1
  0x00007fec3d07509b: mov    $0x2a0000002a,%r10    ; coalesced put fields f2, f3
  0x00007fec3d0750a5: mov    %r10,0x10(%rax)       ;
  0x00007fec3d0750b4: movq   $0x2a,0x18(%rax)      ; put field f4

No barriers, no fuzz.

PowerPC

This is the final chart we got for PowerPC (raw data):

ppc phase3.alloc data
  • Unfortunately, we don’t have C2 port for PowerPC yet, and so only -client data is available.

  • Both positive and negative controls are arguably good. There is some elevated noise comparing to other machines.

  • The same chart zoomed at Y axis suggests the difference between final/minus and plain/minus are within 10 ns:

ppc phase3.alloc data.zoom100

Confirming this with disassembly, plain/minus first: [5]

  ; construct class word ...
  ; construct mark word ...
  ; store mark word
  ; store class word
  ; StoreStore

It is coherent with ARM C1 disassembly: we have the object metadata header guarded by the barrier, and then the naked field stores. The same thing, but for final/minus (plain/plus and final/plus yield the same code): [5]

  ; construct class word ...
  ; construct mark word ...
  ; store mark word
  ; store class word
  ; StoreStore
  ; put field f4
  ; put field f3
  ; put field f2
  ; put field f1
  ; LoadStore|StoreStore

Here, we have an additional trailing barrier after the field stores. This is our MemBarRelease lowered to lwsync. Note that we could also coalesce these barriers, but it requires more advanced coalescing logic in HotSpot.

Part II. Chained Initialization

Rationale

Now that we know the barriers are properly emitted, and also managed to get the basic performance estimates, we can explore the quality of implementation issues a little bit closer. Since we already know the barriers are emitted at the end of constructor in the bytecode parser, it seems educational to stress that part. We can do that by ramping up another targeted benchmark.

The highlights of that benchmark are:

  • We need to initialize N fields. The benchmarks are different in the way what those fields are, and how we initialize them. chained_* initialize the fields by calling N superclass constructors, each of those constructors initialize the single field. merged_* call one big fat constructor. *_final tests initialize final fields, while *_plain initialize plain ones.

  • The test like this is susceptible for inlining effects: as far as optimizer is concerned, calling N methods, or calling a single one can make a difference. Hence, we have to execute this benchmark with -XX:MaxInlineLevel to inline deeply for the case of chained_* benchmarks, and -XX:MaxInlineSize to inline the fat constructors in merged_* benchmarks.

  • Note this test matrix provides the built-in control: merged_plain and chained_plain should yield the same performance, which will indicate the inlining happened for both sides.

This section is short, because the control procedures and validation sequence is similar to the first experiment. We did it, but skip the details here.

x86

x86.chained data

On x86, all tests are behaving similarly (raw data here), as we would expect in the absence of any hardware memory barriers. Compiler barriers are not affecting the performance.

PowerPC

ppc.chained data

On weakly-ordered architectures like PowerPC, it starts to get interesting (raw data here):

  • -server data is not available, because we don’t have PPC C2 port handy.

  • The control is passed: plain/* are the same.

  • final/merged emits the final barrier, which costs around the same 10 ns, as we seen in the previous experiment.

  • final/chained emits the barriers at each superclass constructor, and hence costs more.

ARM

arm.chained data

This effect is getting even more concise on ARMs (raw data here):

  • The control is passed: plain/* match perfectly.

  • final/merged again emits the final barrier, which costs the same 10 ns.

  • final/chained costs linearly more with more and more fields, because the barriers in the superclass' constructors are present. With 8 fields, the overhead of doing these excess barriers takes almost 40% of the allocation costs.

This experiment outlines it is important to do barrier coalescing. We recorded this experiment in JDK-8032218.

Part III. SPECjvm2008

The nano/microbenchmarks are good for coming up with plausible performance model. However, we need to validate the same changes on larger workloads. We also know that the barrier costs might be different in multi-threaded workloads. Having this in mind, we validate the changes with SPECjvm2008, which has the broad set of benchmarks, as well as saturates the system with benchmark copies.

We were only able to have a single run of SPECjvm2008 on two platforms.

ARM

arm.jvm08 data

Observations (raw data here):

  • In client mode, all the sub-benchmark scores are within 5% margin from each other, and the overall suite score difference is within 0.01% ( -XX:+AlwaysSafeConstructors appears slower). Our experience with SPECjvm2008 suggests the very small differences like these can be attributed to measurement noise (more runs needed to rigorously confirm).

  • In server mode, all the sub-benchmark scores are within 10% margin from each other, and the overall suite score difference is within 1.5% ( -XX:+AlwaysSafeConstructors appears slower). The source of the noise are scimark subtests, which are known to exhibit large run-to-run variance. More runs will get us more confidence this is not a regression.

x86

x86.jvm08 data

Observations (raw data here):

  • In client mode, all the sub-benchmark scores are within 10% margin from each other, and the overall suite score difference is within 1.5% ( -XX:+AlwaysSafeConstructors appears slower). Notable outliers are again scimark subtests, which are known to exhibit large run-to-run variance. More runs will get us more confidence this is not a regression.

  • In server mode, all the sub-benchmark scores are within 10% margin from each other, the overall suite score difference is within 1% ( -XX:+AlwaysSafeConstructors appears faster). More runs needed to confirm.

PowerPC

ppc.jvm08 data

Observations (raw data here):

  • xml.validation crashed due to the disk failure, no data there

  • -server data is missing, no build available yet

  • In client mode, all the sub-benchmark scores are within 2% margin from each other, and the overall suite score difference is within 0.1% ( -XX:+AlwaysSafeConstructors appears slower). Our experience says it is well below the noise level for this benchmark suite. More runs needed to confirm.

Conclusions and Further Research

So far we have not identified the show-stopper performance issues in extending the final fields memory semantics to all the fields. The major impact is on weakly-ordered ARM and PPC, which require the memory barriers to enforce needed semantics. However, thhe performance difference even on targeted workloads is not devastating. x86 seems immune to enforcing the final fields guarantee for all the objects. Further research is required on larger set of non-targeted workloads. This testing would be simpler once we integrate the experimental option into mainline VM (JDK-8031818).

This simple solution of inserting the barriers into constructors during bytecode parsing inherently relies on barrier coalescing in HotSpot. It is more or less easy to coalesce the object metadata barrier with the final barrier (JDK-8032481), which will solve the performance costs for single constructor. The rule of humb: the cost of single constructor barrier is around 10ns; compared with usual allocation costs of ~30-40ns on ARM/PPC. The additional overhead of this barrier was not yet observed on larger workloads (TBD: more, more tests).

It is harder to coalesce the barriers across multiple chained constructor calls (JDK-8032218). Targeted tests claim the linear overheads as we get deeper chain of constructors to call, up to the point where 8 simple chained constructors can waste around the half of <init> time executing the barriers. While targeted benchmarks quantify this effect as important, the performance data from large workloads does not indicate this is a problem. (TBD: more, more tests).


1. I am mildly irritated it mentions "compiler", when it really should mention "runtime". Too many people read imply that interpreter is somehow immune from dealing with memory model issues.
2. This and other interesting examples by Hans Boehm, as surfaced during C++11 memory model efforts: http://www.hboehm.info/c++mm/no_write_fences.html
3. "server" mode is frequently used as the alias for C2 compiler, but that is messy since TieredCompilation is involved: both C1 and C2 take place in tiered compilation for "server" mode.
4. Ditto, this is "client" mode compiler.
5. Don’t see the assembly code? This is because Java SE Embedded is not open-source, and publishing the generated code disassembly is arguably not the fair use. You will have to trust me on what I saw, while that makes the claim actually unverifiable. Oh, the humanity!