About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Question
I have heard Hotspot can do stack allocation. Called Escape Analysis, and it is magical. Right?
Theory
This gets a fair bit of confusion. In "stack allocation", "allocation" seems to assume that the entire object is allocated on the stack instead of the heap. But what really happens is that the compiler performs the so called Escape Analysis (EA), which can identify which newly created objects are not escaping into the heap, and then it can do a few interesting optimizations. Note that EA itself is not the optimization, it is the analysis phase that gives important pieces of data for the optimizer.[1]
One of the things that optimizer can do for non-escaping objects is to remap the accesses to the object fields to accesses to synthetic local operands:[2] perform Scalar Replacement. Since those operands are then handled by register allocator, some of them may claim stack slots (get "spilled") in current method activation, and it might look like the object field block is allocated on stack. But this is a false symmetry: operands may not even materialize at all, or may reside in registers, object header is not created at all, etc. The operands that get mapped from object field accesses might not even be contiguous on stack! This is different from stack allocation.
If stack allocation was really done, it would allocate the entire object storage on the stack, including the header and the fields, and reference it in the generated code. The caveat in this scheme is that once the object is escaping, we would need to copy the entire object block from the stack to the heap, because we cannot be sure current thread stays in the method and keeps this part of the stack holding the object alive. Which means we have to intercept stores to the heap, in case we ever store stack-allocated object — that is, do the GC write barrier.
Hotspot does not do stack allocations per se, but it does approximate that with Scalar Replacement.
Can we observe this in practice?
Practice
Consider this JMH benchmark. We create the object with a single field that is initialized off our input, and it reads the field right away, discarding the object:
import org.openjdk.jmh.annotations.*;
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class ScalarReplacement {
int x;
@Benchmark
public int single() {
MyObject o = new MyObject(x);
return o.x;
}
static class MyObject {
final int x;
public MyObject(int x) {
this.x = x;
}
}
}
If you run the test with -prof gc
, you would notice it does not allocate anything:
Benchmark Mode Cnt Score Error Units
ScalarReplacement.single avgt 15 1.919 ± 0.002 ns/op
ScalarReplacement.single:·gc.alloc.rate avgt 15 ≈ 10⁻⁴ MB/sec
ScalarReplacement.single:·gc.alloc.rate.norm avgt 15 ≈ 10⁻⁶ B/op
ScalarReplacement.single:·gc.count avgt 15 ≈ 0 counts
-prof perfasm
shows there is only a single access to field x
left.
....[Hottest Region 1].............................................................
C2, level 4, org.openjdk.ScalarReplacement::single, version 459 (26 bytes)
[Verified Entry Point]
6.05% 2.82% 0x00007f79e1202900: sub $0x18,%rsp ; prolog
0.95% 0.78% 0x00007f79e1202907: mov %rbp,0x10(%rsp)
0.04% 0.21% 0x00007f79e120290c: mov 0xc(%rsi),%eax ; get field $x
5.80% 7.43% 0x00007f79e120290f: add $0x10,%rsp ; epilog
0x00007f79e1202913: pop %rbp
23.91% 33.34% 0x00007f79e1202914: test %eax,0x17f0b6e6(%rip)
0.21% 0.02% 0x00007f79e120291a: retq
...................................................................................
Notice the magic of it: the compiler was able to detect that MyObject
instance is not escaping, remapped its fields to local operands, and then (drum-roll) identified that successive store to that operand follows the load, and eliminated that store-load pair altogether — as it would do with local variables! Then, pruned the allocation, because it is not needed anymore, and any reminiscent of the object had evaporated.
Of course, that requires a sophisticated EA implementation to identify non-escaping candidates. When EA breaks, Scalar Replacement also breaks. The most trivial breakage in current Hotspot EA is when control flow merges before the access. For example, if we have two different objects (yet with the same content), under the branch that selects either of them, EA breaks, even though both objects are evidently (for us, humans) non-escaping:
public class ScalarReplacement {
int x;
boolean flag;
@Setup(Level.Iteration)
public void shake() {
flag = ThreadLocalRandom.current().nextBoolean();
}
@Benchmark
public int split() {
MyObject o;
if (flag) {
o = new MyObject(x);
} else {
o = new MyObject(x);
}
return o.x;
}
// ...
}
Here, the code allocates:
Benchmark Mode Cnt Score Error Units
ScalarReplacement.single avgt 15 1.919 ± 0.002 ns/op
ScalarReplacement.single:·gc.alloc.rate avgt 15 ≈ 10⁻⁴ MB/sec
ScalarReplacement.single:·gc.alloc.rate.norm avgt 15 ≈ 10⁻⁶ B/op
ScalarReplacement.single:·gc.count avgt 15 ≈ 0 counts
ScalarReplacement.split avgt 15 3.781 ± 0.116 ns/op
ScalarReplacement.split:·gc.alloc.rate avgt 15 2691.543 ± 81.183 MB/sec
ScalarReplacement.split:·gc.alloc.rate.norm avgt 15 16.000 ± 0.001 B/op
ScalarReplacement.split:·gc.count avgt 15 1460.000 counts
ScalarReplacement.split:·gc.time avgt 15 929.000 ms
If that was a "true" stack allocation, it would trivially handle this case: it’d extend the stack at runtime for either allocation, do the accesses, then scratch off the stack contents before leaving the method, and stack allocations would get retracted. The complication with write barriers that should guard object escapes still stands.
Observations
Escape analysis is an interesting compiler technique that enables interesting optimizations. Scalar Replacement is one of them, and it is not about putting the object storage on stack. Instead, it is about exploding the object and rewriting the code into local accesses, and optimizing them further, sometimes spilling these accesses on stack when register pressure is high. In many cases on critical hotpaths it can be successfully and profitably done.
But, EA is not ideal: if we cannot statically determine the object is not escaping, we have to assume it does. Complicated control flow may bail earlier. Calling non-inlined — and thus opaque for current analysis — instance method bails. Doing some things that rely on object identity bail, although trivial things like reference comparison with non-escaping objects gets folded efficiently.
This is not an ideal optimization, but when it works, it works magnificently well. Further improvements in compiler technology might widen the number of cases where EA works well.[3]