JVM Anatomy Quark #18: Scalar Replacement

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

I have heard Hotspot can do stack allocation. Called Escape Analysis, and it is magical. Right?

Theory

This gets a fair bit of confusion. In "stack allocation", "allocation" seems to assume that the entire object is allocated on the stack instead of the heap. But what really happens is that the compiler performs the so called Escape Analysis (EA), which can identify which newly created objects are not escaping into the heap, and then it can do a few interesting optimizations. Note that EA itself is not the optimization, it is the analysis phase that gives important pieces of data for the optimizer.^[1]

One of the things that optimizer can do for non-escaping objects is to remap the accesses to the object fields to accesses to synthetic local operands:^[2] perform Scalar Replacement. Since those operands are then handled by register allocator, some of them may claim stack slots (get "spilled") in current method activation, and it might look like the object field block is allocated on stack. But this is a false symmetry: operands may not even materialize at all, or may reside in registers, object header is not created at all, etc. The operands that get mapped from object field accesses might not even be contiguous on stack! This is different from stack allocation.

If stack allocation was really done, it would allocate the entire object storage on the stack, including the header and the fields, and reference it in the generated code. The caveat in this scheme is that once the object is escaping, we would need to copy the entire object block from the stack to the heap, because we cannot be sure current thread stays in the method and keeps this part of the stack holding the object alive. Which means we have to intercept stores to the heap, in case we ever store stack-allocated object — that is, do the GC write barrier.

Hotspot does not do stack allocations per se, but it does approximate that with Scalar Replacement.

Can we observe this in practice?

Practice

Consider this JMH benchmark. We create the object with a single field that is initialized off our input, and it reads the field right away, discarding the object:

import org.openjdk.jmh.annotations.*;

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class ScalarReplacement {

    int x;

    @Benchmark
    public int single() {
        MyObject o = new MyObject(x);
        return o.x;
    }

    static class MyObject {
        final int x;
        public MyObject(int x) {
            this.x = x;
        }
    }

}

If you run the test with -prof gc, you would notice it does not allocate anything:

Benchmark                                      Mode  Cnt     Score    Error   Units

ScalarReplacement.single                       avgt   15     1.919 ±  0.002   ns/op
ScalarReplacement.single:·gc.alloc.rate        avgt   15    ≈ 10⁻⁴           MB/sec
ScalarReplacement.single:·gc.alloc.rate.norm   avgt   15    ≈ 10⁻⁶             B/op
ScalarReplacement.single:·gc.count             avgt   15       ≈ 0           counts

-prof perfasm shows there is only a single access to field x left.

....[Hottest Region 1].............................................................
C2, level 4, org.openjdk.ScalarReplacement::single, version 459 (26 bytes)

                  [Verified Entry Point]
  6.05%    2.82%    0x00007f79e1202900: sub    $0x18,%rsp          ; prolog
  0.95%    0.78%    0x00007f79e1202907: mov    %rbp,0x10(%rsp)
  0.04%    0.21%    0x00007f79e120290c: mov    0xc(%rsi),%eax      ; get field $x
  5.80%    7.43%    0x00007f79e120290f: add    $0x10,%rsp          ; epilog
                    0x00007f79e1202913: pop    %rbp
 23.91%   33.34%    0x00007f79e1202914: test   %eax,0x17f0b6e6(%rip)
  0.21%    0.02%    0x00007f79e120291a: retq
...................................................................................

Notice the magic of it: the compiler was able to detect that MyObject instance is not escaping, remapped its fields to local operands, and then (drum-roll) identified that successive store to that operand follows the load, and eliminated that store-load pair altogether — as it would do with local variables! Then, pruned the allocation, because it is not needed anymore, and any reminiscent of the object had evaporated.

Of course, that requires a sophisticated EA implementation to identify non-escaping candidates. When EA breaks, Scalar Replacement also breaks. The most trivial breakage in current Hotspot EA is when control flow merges before the access. For example, if we have two different objects (yet with the same content), under the branch that selects either of them, EA breaks, even though both objects are evidently (for us, humans) non-escaping:

public class ScalarReplacement {

    int x;
    boolean flag;

    @Setup(Level.Iteration)
    public void shake() {
        flag = ThreadLocalRandom.current().nextBoolean();
    }

    @Benchmark
    public int split() {
        MyObject o;
        if (flag) {
            o = new MyObject(x);
        } else {
            o = new MyObject(x);
        }
        return o.x;
    }

   // ...
}

Here, the code allocates:

Benchmark                                      Mode  Cnt     Score    Error   Units

ScalarReplacement.single                       avgt   15     1.919 ±  0.002   ns/op
ScalarReplacement.single:·gc.alloc.rate        avgt   15    ≈ 10⁻⁴           MB/sec
ScalarReplacement.single:·gc.alloc.rate.norm   avgt   15    ≈ 10⁻⁶             B/op
ScalarReplacement.single:·gc.count             avgt   15       ≈ 0           counts

ScalarReplacement.split                        avgt   15     3.781 ±  0.116   ns/op
ScalarReplacement.split:·gc.alloc.rate         avgt   15  2691.543 ± 81.183  MB/sec
ScalarReplacement.split:·gc.alloc.rate.norm    avgt   15    16.000 ±  0.001    B/op
ScalarReplacement.split:·gc.count              avgt   15  1460.000           counts
ScalarReplacement.split:·gc.time               avgt   15   929.000               ms

If that was a "true" stack allocation, it would trivially handle this case: it’d extend the stack at runtime for either allocation, do the accesses, then scratch off the stack contents before leaving the method, and stack allocations would get retracted. The complication with write barriers that should guard object escapes still stands.

Observations

Escape analysis is an interesting compiler technique that enables interesting optimizations. Scalar Replacement is one of them, and it is not about putting the object storage on stack. Instead, it is about exploding the object and rewriting the code into local accesses, and optimizing them further, sometimes spilling these accesses on stack when register pressure is high. In many cases on critical hotpaths it can be successfully and profitably done.

But, EA is not ideal: if we cannot statically determine the object is not escaping, we have to assume it does. Complicated control flow may bail earlier. Calling non-inlined — and thus opaque for current analysis — instance method bails. Doing some things that rely on object identity bail, although trivial things like reference comparison with non-escaping objects gets folded efficiently.

This is not an ideal optimization, but when it works, it works magnificently well. Further improvements in compiler technology might widen the number of cases where EA works well.^[3]

1. I am mildly irritated when people claim EA does something: it’s not, further optimizations do!

2. Like the ones the intermediate representation has for local variables and other temporary operands compiler wants to have

3. For example, Graal is known to have Partial Escape Analysis, that is supposed to be more resilient in complex data flows