Faster Atomic*FieldUpdaters for Everyone

Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

This post is also available in ePUB and mobi.

The post uses JMH as the research crucible. If you haven’t learned about it yet, and/or haven’t looked through the JMH samples, I suggest you do that first before reading the rest of this post for the best experience.

This post requires a good understanding of low-level Java and JVM workings. It plows through the basics very quickly. If you want an in-depth explanation on how to approach low level performance work, buckle up and read "Black Magic of Method Dispatch".

Thanks to Nitsan Wakart (@nitsanw), Michael Barker (@mikeb2701), Jonathan Yu (@jawnsy), Paul Sandoz (@PaulSandoz), and others for reviews and helpful suggestions!

Introduction

Although use of the internal sun.* classes has been actively discouraged for a long time, some (especially performance-critical) software has come to depend on sun.misc.Unsafe methods. With the advent of Java 9 and the JEP 260: encapsulation effort, these internal APIs will become inaccessible by default.

There are several facets of the problem, and this post will examine one of them: access to advanced operations on fields of Java objects. There are existing APIs that provide us this functionality, notably AtomicIntegerFieldUpdater, AtomicLongFieldUpdater, and AtomicReferenceFieldUpdater from the java.util.concurrent.atomic package.

In JDK 9, we are working on JEP 193: VarHandles that tries to provide a set of APIs for safe access to the low-level operations on different types of objects. There is a bit of intersection with existing APIs, notably the operations on the existing object fields. It turns out, we may use some of the things we learned while building the VarHandles implementation in the existing classes. Thus, we can provide a partial migration strategy before JDK 9 hits the shelves.

Meet Atomic*FieldUpdaters

If you look at A*FU classes, then you will see the suggested use case for them is as follows:

class MyClass {
  static final AtomicIntegerFieldUpdater<MyClass> UP =
    AtomicIntegerFieldUpdater.newUpdater(MyClass.class, "v");

  volatile int v;

  public int atomicIncrement() {
    return UP.getAndAdd(this, 1);
  }
}

Looking at the implementation internals, you can see that AIFU performs some safety checks before ultimately invoking Unsafe:

public abstract class AtomicIntegerFieldUpdater<T> {

  public static <U> AtomicIntegerFieldUpdater<U> newUpdater(Class<U> tclass,
                                                            String fieldName) {
    return new AtomicIntegerFieldUpdaterImpl<U>
           (tclass, fieldName, Reflection.getCallerClass());
  }

  private static class AtomicIntegerFieldUpdaterImpl<T>
    extends AtomicIntegerFieldUpdater<T> {

    public int getAndAdd(T obj, int delta) {
      if (obj == null || obj.getClass() != tclass || cclass != null)
        fullCheck(obj);
      return unsafe.getAndAddInt(obj, offset);
    }

    private void fullCheck(T obj) {
      if (!tclass.isInstance(obj))
        throw new ClassCastException();
      if (cclass != null)
        ensureProtectedAccess(obj);
    }
  }
}

…where tclass is the field holder class, and cclass is the caller class that instantiated the AIFU to do additional security checks (only populated when the field is protected). The safety pre-checks are present in every business method in every A*FU class.

Step 1. Admit You Have A Problem

Of course, before doing just about anything about a hypothetical performance problem, you have to understand if you actually do have a performance problem ("Hi, I’m Aleksey and I have a performance problem"). First, we have to admit that A*FU classes are used on very busy hotpaths in high-performance code, and sometimes people resort to Unsafe to keep those costs at bay.

Microbenchmarking comes in very handy for research like this, as it allows us to focus on the particular code sample running the specified conditions. Note that benchmarks seldom lie — computers are generally not equipped with a mind to even consider lying — but they do very frequently answer the wrong question, because you screwed up the environmental/benchmarking setup.

To answer "How bad are Atomic*FieldUpdaters, really?", we need to establish the baseline. We have chosen plain field accesses as the baseline, because that might be the fastest way to access memory in Java programs (as you will see later). In hindsight, we can write the benchmark like this:

@State(Scope.Benchmark)
public class AFUBench {
  A a;
  B b;

  @Setup
  public void setup() {
    a = new A();
    b = new B(); // pollute the class hierarchy
  }

  @Benchmark
  public int updater() {
    return a.updater();
  }

  @Benchmark
  public int plain() {
    return a.plain();
  }

  public static class A {
    static final AtomicIntegerFieldUpdater<A> UP
              = AtomicIntegerFieldUpdater.newUpdater(A.class, "v");

    volatile int v;

    public int updater() {
      return UP.get(this);
    }

    public int plain() {
      return v;
    }
  }

  public static class B extends A {}
}

Of course, the actual benchmarks where the issues were discovered are more thorough and check much more cases. We use a simplified example in this post for the sake of better flow: the example above is one of the worst cases for A*FU code, as we will see later. If we run this benchmark on my development desktop (i7-4790K, Linux x86_64) with the latest JDK 8u66, then we will see this:

Benchmark         Mode  Cnt  Score   Error  Units
AFUBench.plain    avgt   25  1.965 ± 0.001  ns/op
AFUBench.updater  avgt   25  3.007 ± 0.004  ns/op

"This is just one nanosecond", one might say, but these nanoseconds add up on hot paths. The performance difference is much more visible on hardware that cannot speculate heavily. I frequently run the benchmarks on the Atom dev server from my home farm (Atom Z530, Linux i586), as that beast is very sensitive to the generated code quality.

Benchmark         Mode  Cnt   Score   Error  Units
AFUBench.plain    avgt   25  21.436 ± 0.014  ns/op
AFUBench.updater  avgt   25  34.669 ± 0.025  ns/op

Whoa. >1.6x performance difference and it is not a single nanosecond anymore. Bad.

Developing fixes and conducting performance analysis on that hardware is more complicated than hacking directly on my dev desktop, since it entails cross-compiling x86_64 → x86, so we will keep bashing the tests on very fast x86_64, but taking care of more observable behaviors, like hardware counters and the generated code.

It helps to quickly characterize the workload to see where the problem might be. JMH provides the Linux’s perf_event bindings with -prof perfnorm, which normalizes the counters per benchmark operation:

Benchmark                                 Mode  Cnt   Score    Error  Units
AFUBench.plain                            avgt   25   1.989 ±  0.034  ns/op
AFUBench.plain:·CPI                       avgt    5   0.318 ±  0.012   #/op
AFUBench.plain:·L1-dcache-load-misses     avgt    5  ≈ 10⁻³            #/op
AFUBench.plain:·L1-dcache-loads           avgt    5  17.368 ±  3.469   #/op
AFUBench.plain:·L1-dcache-store-misses    avgt    5  ≈ 10⁻⁴            #/op
AFUBench.plain:·L1-dcache-stores          avgt    5   4.345 ±  0.874   #/op
AFUBench.plain:·branch-misses             avgt    5  ≈ 10⁻⁴            #/op
AFUBench.plain:·branches                  avgt    5   5.775 ±  1.114   #/op
AFUBench.plain:·cycles                    avgt    5  11.472 ±  2.525   #/op
AFUBench.plain:·instructions              avgt    5  36.073 ±  6.926   #/op

AFUBench.updater                          avgt   25   3.009 ±  0.002  ns/op
AFUBench.updater:·CPI                     avgt    5   0.280 ±  0.002   #/op
AFUBench.updater:·L1-dcache-load-misses   avgt    5   0.001 ±  0.004   #/op
AFUBench.updater:·L1-dcache-loads         avgt    5  24.832 ±  2.255   #/op
AFUBench.updater:·L1-dcache-store-misses  avgt    5  ≈ 10⁻³            #/op
AFUBench.updater:·L1-dcache-stores        avgt    5   5.838 ±  0.514   #/op
AFUBench.updater:·branch-misses           avgt    5  ≈ 10⁻⁴            #/op
AFUBench.updater:·branches                avgt    5   8.775 ±  0.878   #/op
AFUBench.updater:·cycles                  avgt    5  17.587 ±  1.707   #/op
AFUBench.updater:·instructions            avgt    5  62.859 ±  6.344   #/op

This data is abbreviated, to show where the problems lurk. There seems to be no ILP problems, as both versions run with CPI = 0.3 clk/insn, which is a good CPI for my Haswell. In fact, A*FU code has even better ILP. It seems that A*FU code does more loads, more stores (which includes spilling operands on stack), more branches, and more instructions. So, this does look like we need to shave off the excess code off the hotpath.

Indeed, if we use PrintAssembly to dump the generated code, we can clearly see the difference. JMH provides handy integration with -prof perfasm, that uses perf to contrast the hot regions in the compiled code. In this example, and in the examples further, we only show the hot benchmark loops, along with JMH benchmark scaffolding (Blackholes, operation counters, termination flags).

This is what the plain scenario looks like:

    LOOP:
 ↗  mov    0x8(%rsp),%r10
 │  mov    0xc(%r10),%r10d        ; get field $a
 │  mov    0xc(%r12,%r10,8),%edx  ; get field $a.v
 │  mov    0x10(%rsp),%rsi        ; prepare and call Blackhole.consume
 │  callq  CONSUME
 │  mov    0x18(%rsp),%r10
 │  movzbl 0x94(%r10),%r10d       ; get field $isDone
 │  add    $0x1,%rbp              ; ops++
 │  test   %eax,0x16d88ff1(%rip)  ; safepoint poll
 │  test   %r10d,%r10d            ; if (!isDone), get back
 ╰  je     LOOP

This is almost the absolute minimum: the majority of the code is the benchmarking infrastructure, with only two instructions as the "business" payload. The updater scenario has much more cruft:

    LOOP:
 ↗  mov    0x10(%rsp),%r10
 │  mov    0xc(%r10),%r10d        ; get field field $a
 │  mov    0x8(%r12,%r10,8),%r9d  ; get $a.class
 │  movabs $0x719d45a88,%r11      ; {constant: AIFUImpl instance}
 │  mov    0xc(%r11),%r11d        ; get AIFUImpl.tclass
 │  movabs $0x0,%r8               ; <some magic: Class.isInstance>
 │  lea    (%r8,%r9,8),%r8
 │  mov    0x68(%r8),%r9
 │  mov    %r11,%r8
 │  shl    $0x3,%r8
 │  cmp    %r8,%r9
 │  jne    SLOWPATH_1
 │  movabs $0x719d45a88,%r11      ; {constant: AIFUImpl instance}
 │  mov    0x18(%r11),%r8d        ; get AIFUImpl.cclass
 │  test   %r8d,%r8d              ; null check
 │  jne    SLOWPATH_2             ; if (cclass == null), jump out
 │  mov    %rcx,(%rsp)
 │  mov    0x10(%r11),%r11        ; get AIFUImpl.offset
 │  shl    $0x3,%r10              ; unpack $a reference
 │  mov    (%r10,%r11,1),%edx     ; Unsafe: get field $a@offset
 │  mov    0x18(%rsp),%rsi        ; prepare and call Blackhole.consume
 │  nop
 │  callq  CONSUME
 │  mov    (%rsp),%rcx
 │  movzbl 0x94(%rcx),%r10d       ; get field $isDone
 │  add    $0x1,%rbp              ; ops++
 │  test   %eax,0x181c55ee(%rip)  ; safepoint poll
 │  test   %r10d,%r10d            ; if (!isDone), get back
 ╰  je     LOOP

There is a fair amount of excess code, and we better handle it step by step. It would be nice to get A*FU performance closer to Unsafe performance, if not to plain field access.

Step 2. Handle Constants

The most obvious thing in the disassembly above is a weird handling of constants. While we know the address of the AIFUImpl object itself, we re-read the fields from there every time. Before attacking this problem, we have to first understand a few things on how optimizers handle final fields.

One could naively presume that a final field value can be used during the compilation, since it is stable and known. The reality begs to differ on two points.

Final Fields and Holder’s Identity

In most of the cases, the final instance field value depends on field holder’s identity. For example, you can have several instances of a class, then that would be enough to confuse the optimizer:

public class A {
  final boolean v;
  A(boolean pv) { v = pv; }
}

public class ToBe extends A {
  ToBe() { super(true); }
}

public class NotToBe extends A {
  NotToBe() { super(false); }
}

void print() {
  System.out.println(decipher(new ToBe()));
  System.out.println(decipher(new NotToBe()));
}

// Assume this method compiles separately
boolean decipher(A a) {
  return a.v; // ToBe or NotToBe? That's the question.
}

Of course, you can come up with uniqueness analysis, e.g. track if there is exactly one instance of the class, but that will most probably penalize the allocation fastpath, since it would have to consult the uniqueness metadata on each instance allocation. Note that it is different from tracking if there is a unique class in the hierarchy, because class loading is a heavy-weight operation.

The identity analysis is easy when you can establish the identity statically, e.g. if we know that the instance is loaded from the constant, or data flow suggests we always observe a particular instance. For example, when the method decipher(A a) is inlined in the example above, and we know what is the exact instance field we are reading from.

Now, with the suggested use patterns for A*FU, we are already avoiding this pitfall. Remember that A*FU instances are supposed to be stored in static final fields, so their instances are constants. This is why the disassembly above has the exact object address in the generated code.

If your A*FU instances are not stored in static final fields, then many fruitful optimizations, especially described here, are thrown out of the window. Do not deviate from the suggested usage patterns if you don’t understand the full implications of doing that.

VarHandles (as do MethodHandles) require the same use pattern for the ultimate performance. Until we have a better way to track final/stable fields we are mostly stuck rooting an object as a static final, or placing something in the constant pool.

Final Field Writes

There is another problem: damned final field writers (serialization libraries, looking at you). Even if the final field holder is constant, we cannot trust its final fields, since somebody elsewhere can overwrite the value using Reflection or Unsafe.

public class Defcon implements NuclearAttackListener {
  // 1991/12/26: Changed from "volatile" to "final"
  final int level;

  void onMultipleICBMsInbound() {
    // Purify all of our precious bodily fluids:
    reflectionHack_setLevel(this, 1);
  }
}

class WOPR {
  static final Defcon DC = DC.connectToNORAD();

  void theOnlyWinningMove() {
    // We trust humankind future on compiler optimizations
    while (DC.level == 5) {
      blinkLEDs();
      sleep(1);
    }
    nuclearLaunch();
  }
}

This pitfall is harder. While technically folding the final field values is simple, the impact on real world applications is unknown. Therefore, HotSpot has a special experimental flag that turns this folding on, letting users make the judgment call if their applications suffer from final field writes — that flag is -XX:+TrustFinalNonStaticFields. There is experimental work in place to fix this properly.

Using non-standard VM options may affect the semantics of your programs, so take extreme care. At the very least, run all the tests you’ve got before changing any option on production servers.

Indeed, if we run with that flag enabled, our benchmarks and the generated code improve substantially! Unfortunately, we cannot flip the flag on by default, so what’s left to do?

Solution

If we search for the uses of TrustFinalNonStaticFields in Hotspot sources, then we will discover a method that apparently tells if we should trust the final fields for a particular class:

static bool trust_final_non_static_fields(ciInstanceKlass* holder) {
  if (holder == NULL)
    return false;
  if (holder->name() == ciSymbol::java_lang_System())
    // Never trust strangely unstable finals:  System.out, etc.
    return false;
  // Even if general trusting is disabled, trust system-built closures in these packages.
  if (holder->is_in_package("java/lang/invoke") || holder->is_in_package("sun/invoke"))
    return true;
  return TrustFinalNonStaticFields;
}

Excellente! Now, can we add the exceptions for A*FU classes here as well? From maintainers' perspective, that is a questionable move, as it contaminates the VM code with knowledge about non-special class library classes.^[1] But, in this particular case, the observable profits to unhook users from Unsafe dominate. Therefore, we made this simple change: JDK-8140483 — that adds known A*FU subclasses to the list.

Let’s see how it performs now:

Benchmark         Mode  Cnt  Score   Error  Units
AFUBench.plain    avgt   25  1.988 ± 0.039  ns/op
AFUBench.updater  avgt   25  2.419 ± 0.032  ns/op  (was: 3.007 ± 0.004  ns/op)

Oh, it seems much better, but is this the real deal? Looking at the generated code for updater tests reveals that we now fold the constant values nicely, making many computations redundant, and hence eliminated by the optimizer:

    LOOP:
 ↗  mov    0x10(%rsp),%r10
 │  mov    0xc(%r10),%r11d         ; get field $a
 │  mov    0x8(%r12,%r11,8),%r10d  ; get $a.class
 │  cmp    $0xf801235a,%r10d       ;  {metadata(org/openjdk/AFUBench$A)}
 │  jne    OUT_SLOWPATH            ; if ($a.class != AFUBench.class), jump out
 │  lea    (%r12,%r11,8),%r10      ; unpack $a reference
 │  mov    0xc(%r10),%edx          ; get $a@(offset=0xc)
 │  mov    0x18(%rsp),%rsi         ; prepare and call Blackhole.consume
 │  data16 xchg %ax,%ax
 │  callq  CONSUME
 │  mov    (%rsp),%r10
 │  movzbl 0x94(%r10),%r10d        ; get field $isDone
 │  mov    0x20(%rsp),%r11         ; spill in $ops
 │  add    $0x1,%r11               ; $ops++
 │  mov    %r11,0x20(%rsp)         ; spill out $ops
 │  test   %eax,0x181d9690(%rip)   ; safepoint poll
 │  test   %r10d,%r10d             ; if (!isDone), jump back
 ╰  je     LOOP

Step 3. Handle Type Checks

While the generated code looks better, we are still not there. The apparent trouble, as we can tell by comparing with the plain disassembly is that we are doing the type check against the desired target class. Interestingly, the current code performs a "quick" direct class check, with a fallback to Class.isInstance in a fullCheck() if necessary:

private static class AtomicIntegerFieldUpdaterImpl<T>
  extends AtomicIntegerFieldUpdater<T> {

  public final int get(T obj) {
    if (... || obj.getClass() != tclass || ...) fullCheck(obj);
      ...
    }

  private void fullCheck(T obj) {
    if (!tclass.isInstance(obj))
      throw new ClassCastException();
    ...
  }
}

But is it really worth it?

Class.isInstance Performance

It turns out the pre-check is not worth it, because Class.isInstance is intrinsified by compilers, and the optimizers are good enough in dealing with it; even better, it also checks for subclasses. That means, in our A*FU use cases, optimizers are able to figure out that any this from inside the relevant class would always be accepted by the isInstance check, and optimizers can fold it. This is not the case for getClass-style direct checks.

For example, the targeted benchmark like this:

@State(Scope.Benchmark)
public class ClassIsInstance {

    A a1;
    A a2;
    B b;

    @Setup
    public void setup() {
        a1 = new A();
        a2 = new B();
         b = new B();
    }

    @Benchmark
    public boolean direct_a1()       { return a1.direct(); }

    @Benchmark
    public boolean direct_a2()       { return a2.direct(); }

    @Benchmark
    public boolean direct_b()        { return b.direct(); }

    @Benchmark
    public boolean isInstance_a1()   { return a1.isInstance(); }

    @Benchmark
    public boolean isInstance_a2()   { return a2.isInstance(); }

    @Benchmark
    public boolean isInstance_b()    { return b.isInstance(); }

    private static class A {
        public boolean isInstance() {
            return A.class.isInstance(this);
        }

        public boolean direct() {
            return A.class == this.getClass();
        }
    }

    private static class B extends A {}
}

The non-obvious caveat is, this has the runtime type of the receiver: if we call a2.direct(), then this.getClass() == B.class. This is why, when A.direct() compiles, optimizers cannot fold the type check: it is plausible to call this method with both A and B receivers. But isInstance is a more encompassing test, and it will always return true. Let’s check anyway:

Benchmark                      Mode  Cnt  Score   Error  Units
ClassIsInstance.direct_a1      avgt   25  2.269 ± 0.047  ns/op  // !
ClassIsInstance.direct_a2      avgt   25  2.430 ± 0.349  ns/op  // !
ClassIsInstance.direct_b       avgt   25  2.062 ± 0.059  ns/op
ClassIsInstance.isInstance_a1  avgt   25  2.051 ± 0.002  ns/op
ClassIsInstance.isInstance_a2  avgt   25  2.058 ± 0.028  ns/op
ClassIsInstance.isInstance_b   avgt   25  2.050 ± 0.007  ns/op

Studying the generated code confirms the hypothesis. This is exactly what we see in A*FU cases. Therefore, pulling the tclass.isInstance from fullCheck can remove the typecheck, e.g.:

$ diff src/share/classes/java/util/concurrent/atomic/AtomicIntegerFieldUpdater.java
@@ -456,7 +456,7 @@
         }

         public final int get(T obj) {
-            if (obj == null || obj.getClass() != tclass || cclass != null) fullCheck(obj);
+            if (obj == null || !tclass.isInstance(obj) || cclass != null) fullCheck(obj);
             return unsafe.getIntVolatile(obj, offset);
         }

That works well if we inlined the entire method tree, so that dataflow can actually collapse the computation. If you don’t inline AIFU.get(this), or any other method that has to pass this to class.isInstance, you are screwed. With that, we are coming to another pitfall:

C1 Inlining Strategy

Doing the Class.isInstance change blindly highlights another interesting case: it regresses C1!^[2] While there is a fair amount of controversy whether you want to optimize C1 or not, when a very ubiquitous code path is concerned, it is much better to demonstrate that the code change performance does not regress simple compilers. It is not the case for our naive change:

Benchmark         Mode  Cnt  Score   Error  Units

# C1 (-XX:TieredStopAtLevel=1) with a direct class check
AFUBench.plain    avgt   25  2.523 ± 0.013  ns/op
AFUBench.updater  avgt   25  5.148 ± 0.075  ns/op

# C1 (-XX:TieredStopAtLevel=1) with Class.isInstance
AFUBench.plain    avgt    5  2.528 ± 0.075  ns/op
AFUBench.updater  avgt    5  8.187 ± 0.154  ns/op  (Bad, bad, bad)

If you explore this, then you will notice that C1 does not inline some A*FU methods for a simple reason: it does not have profile information, and therefore relies on a much lower method size threshol (MaxInlineSize = 35 bytes), rather than C2 that cuts off hot methods inlining at a much larger threshold (FreqInlineSize = 325 bytes).^[3]

Once inlining breaks, constant propagation described above breaks. Once we inline everything down the A*FU methods, C1 and C2 work nicely with Class.isInstance change.

Solution

The solution is to both do the Class.isInstance check early, and peel the methods while rewiring the code, to aid static inlining. That is the end result of JDK-8140587, as seen in the actual code change, that is a more thorough form of the earlier proof-of-concept patch:

$ diff src/java.base/share/classes/java/util/concurrent/atomic/AtomicIntegerFieldUpdater.java
@@ -429,8 +429,20 @@
         }

         private void fullCheck(T obj) {
+            typeCheck(obj);
+            accessCheck(obj);
+        }
+
+        static void throwCCE() {
+            throw new ClassCastException();
+        }
+
+        private void typeCheck(T obj) {
             if (!tclass.isInstance(obj))
-                throw new ClassCastException();
+                throwCCE();
+        }
+
+        private void accessCheck(T obj) {
             if (cclass != null)
                 ensureProtectedAccess(obj);
         }
@@ -456,7 +468,8 @@
         }

         public final int get(T obj) {
-            if (obj == null || obj.getClass() != tclass || cclass != null) fullCheck(obj);
+            typeCheck(obj);
+            accessCheck(obj);
             return U.getIntVolatile(obj, offset);
         }

Let’s see if that helps:

Benchmark         Mode  Cnt  Score   Error  Units

# C2 (default)
AFUBench.plain    avgt   25  1.975 ± 0.020  ns/op
AFUBench.updater  avgt   25  2.046 ± 0.020  ns/op  (was: 2.419 ± 0.032 ns/op)

# C1 (-XX:TieredStopAtLevel=1)
AFUBench.plain    avgt   25  2.513 ± 0.004  ns/op
AFUBench.updater  avgt   25  3.546 ± 0.067  ns/op  (was: 5.148 ± 0.075 ns/op)

Oh yes, it does! Let’s look at C2 disassembly, and realize we are very close to the plain field performance, with almost no cruft left:

    LOOP:
 ↗  mov    0x8(%rsp),%r10
 │  mov    0xc(%r10),%r11d         ; get field $a
 │  test   %r11d,%r11d             ; null-check $a
 │  je     OUT_TO_NPE              ; if ($a == null), jump to throwing NPE
 │  mov    0xc(%r12,%r11,8),%edx   ; unpack and get $a@(offset=0xc)
 │  mov    0x10(%rsp),%rsi         ; prepare and call Blackhole.consume
 │  data16 xchg %ax,%ax
 │  callq  CONSUME
 │  mov    0x18(%rsp),%r10
 │  movzbl 0x94(%r10),%r10d        ; get field $isDone
 │  add    $0x1,%rbp               ; $ops++
 │  test   %eax,0x17a7e0d9(%rip)   ; safepoint poll
 │  test   %r10d,%r10d             ; if (!isDone), jump back
 ╰  je     LOOP

In fact, running this test on the aforementioned Atom server says we are very close indeed, even on not very sophisticated hardware:

Benchmark         Mode  Cnt   Score   Error  Units
AFUBench.plain    avgt   25  21.459 ± 0.038  ns/op
AFUBench.updater  avgt   25  21.845 ± 0.025  ns/op (was: 34.669 ± 0.025 ns/op)

Step 5. Handle Null Checks?

The only remaining thing left is to deal with the null check, because that’s the only difference against the plain field access. But hold on, why doesn’t a plain field access perform a null-check? It should, because the receiver can be null, and we should be able to throw NullPointerException there:

public class AFUBench {
  ...

  A a;

  @Setup
  public void setup() {
    a = new A();
  }

  @Benchmark
  public int plain() {
    return a.plain();
  }

  ...
}

The hint lies in the non-abbreviated disassembly, that has a helpful comment:

 mov 0xc(%r10),%r10d       ;*getfield a
                           ; - org.openjdk.AFUBench::plain@1 (line 32)
 mov 0xc(%r12,%r10,8),%edx ;*getfield v
                           ; - org.openjdk.AFUBench$A::plain@1 (line 46)
       LOOK HERE --------> ; implicit exception: dispatches to 0x00007f40807af465

This is a trap-based implicit null check. If a receiver instance is null, then the machine would access the zero page, and trigger the good ol' SEGV signal, which JVM can intercept, look at the return address, and figure that a null check had failed at the particular place in the generated code.

Thy Name is Implicit

One can say, "Look, obviously, doing a.v generates the implicit null check; but your A*FU code has the explicit check!" This is the wrong way to think about this. While the name is "implicit", compilers routinely subsume the explicit checks, as demonstrated by this simple test:

@State(Scope.Benchmark)
public class ImplicitNullCheck {

  Target t;

  @Setup
  public void setup() {
    t = new Target();
  }

  @Benchmark
  public int plain() {
    return t.f;
  }

  @Benchmark
  public int plain_checked_local() {
    Target v = this.t;
    if (v == null) {
      throw new NullPointerException();
    }
    return v.f;
  }

  private static class Target {
    public volatile int f;
  }
}

These yield the same score:

Benchmark                              Mode  Cnt  Score   Error  Units
ImplicitNullCheck.plain                avgt   25  1.983 ± 0.029  ns/op
ImplicitNullCheck.plain_checked_local  avgt   25  1.993 ± 0.050  ns/op

…which is not surprising, because they yield exactly the same machine code. See JDK-8144717 for the disassembly.

Unsafe != SuperFast

The real trouble is the Unsafe access itself. If you add another test in the benchmark above:

static final long OFF;
static final Unsafe U;

static {
  try {
    Field field = Unsafe.class.getDeclaredField("theUnsafe");
    field.setAccessible(true);
    U = (Unsafe) field.get(null);
    OFF = U.objectFieldOffset(Target.class.getDeclaredField("f"));
  } catch (Exception e) {
    throw new AssertionError(e);
  }
}

@Benchmark
public int unsafe_checked_local() {
  Target v = this.t;
  if (v == null) {
    throw new NullPointerException();
  }
  return U.getIntVolatile(v, OFF);
}

…then it will be penalized in the exact same way our A*FU case is penalized:

Benchmark                               Mode  Cnt  Score   Error  Units
ImplicitNullCheck.plain                 avgt   25  1.966 ± 0.007  ns/op
ImplicitNullCheck.plain_checked_local   avgt   25  1.977 ± 0.026  ns/op
ImplicitNullCheck.unsafe_checked_local  avgt   25  2.050 ± 0.038  ns/op

In this particular example, compiler had intentionally prevented the code movement of that Unsafe access, in order to keep the IR invariants sane, that broke subsuming the null check:

  // Memory barrier to prevent normal and 'unsafe' accesses from
  // bypassing each other.  Happens after null checks, so the
  // exception paths do not take memory state from the memory barrier,
  // so there's no problems making a strong assert about mixing users
  // of safe & unsafe memory.
  if (need_mem_bar) insert_mem_bar(Op_MemBarCPUOrder);

Whether it can be fixed without messing up the compiler, is a question for JDK-8144717. This is yet another counter-example against the permeating myth that Unsafe means SuperFast — it doesn’t. Unsafe has its own performance caveats ^[4], and avoiding them requires much more in-depth JVM knowledge than many Java users have. This is another reason why we need to keep it isolated within the public API wrappers like j.u.c.atomics or upcoming VarHandles.

Solution

The solution is to back off slowly, because the only issue left is endemic to Unsafe itself, and it should be solved there. In fact, we would get the same generated code with the Unsafe access, or any other solution involving Unsafe, let it either be a custom class, VarHandles, etc. The "methodological problem" here is that we have chosen a very aggressive baseline — plain fields. We have already hit the "Be at least as fast as Unsafe" target.

Conclusion and Parting Thoughts

With a few simple tricks, you can get the Atomic*FieldUpdater performance up to the performance of the naked Unsafe call. While that may be not exactly the cost of doing the plain field access, it still provides a clear-cut way to rewrite at least some "wild" Unsafe usages back to public APIs. These tricks are platform-independent, and method-independent, as they optimize out the check paths through which all A*FU methods go.
The fixes are already available in JDK 9 (by the time you would read this, probably even in the weekly EA builds), and, if everything plays out fine, the backports should be available in the next JDK 8u76 (1, 2, 3).
These tricks, while requiring some exposure to low-level performance engineering, are not rocket science. The approaches to benchmarking, analyzing, optimizing these simple path length cases are well understood. The discussions leading to follow-ups and more testing are held in the open, and the development work is supported by many bright people like Doug Lea, Martin Buchholz, Paul Sandoz, John Rose, and many others.

What baffles me is that, most experienced people who did Unsafe tricks for performance, would have better impact on the world by allocating their time on fixing the underlying problems in the JDK code, rather than maintaining workarounds that also lock them up into a private API. I think I spent more time writing this post, than fixing the issues themselves. This is literally something you can do over a few weekends.

A vast majority of performance issues are easily fixable, provided you look for the fixes, no matter how complicated the codebase appears to be. Of course, when one starts to hit compiler optimizations requiring more extensive C1/C2 knowledge it becomes harder to fix, but still it’s often easy to identify the cause with careful benchmarking and data which almost certainly helps those with more knowledge fix the problem more quickly.
If you really care about this, it would be a good idea to participate in VarHandles development and testing. So far only a few people did, which I can say is a very small fraction of those who cried about Unsafe on Twitter, Reddit, and HackerNews.

1. @Stable annotation would come handy for this kind of hackery, but it would not be available even within the class library in JDK 8.

2. Or, in another (misnomer) classification, the -client compiler. It runs in current JVMs anyhow as part of Tiered Compilation. The final tier is handled by C2, or -server compiler.

3. Granted, this may be solved by other means. E.g. revisiting the C1 inlining policy — that’s the fight you don’t want to fight, as it has surprising consequences on compile time and time-to-performance. Or, using @ForceInline, but only after it arrives to a package where A*FU can actually use them, see JDK-8144223. But, this would not be available in JDK 8, because no module boundaries are protecting these annotations from leaking outside the core library.

4. My recent favorite is JDK-8074124: "Most Unsafe.get*() access shapes are losing vs. the plain Java accesses".