JVM Anatomy Quark #6: New Object Stages

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

So I’ve heard allocation is not initialization. But Java has constructors! Are they allocating? Or initializing?

Theory

If you open the GC Handbook, it would tell you that creating a new object usually entails three phases:

Allocation. That is, figuring out what part of process space to get for instance data.
System initialization. That is, the initialization required by the language. In C, no initialization is required for new-ly allocated objects. In Java, system initialization is required for all objects: it is expected to see only default values for a newly created object, it is expected to see all headers intact, etc.
Secondary (user) initialization. That is, running whatever instance initializers and constructors associated with that object type.

We have covered (1) in previous note, "TLAB Allocation". It is time to see what initialization is doing. If you are familiar with Java bytecode, then you will know that "new" code path takes several bytecode instructions. For example, this:

public Object t() {
  return new Object();
}

…compiles into:

  public java.lang.Object t();
    descriptor: ()Ljava/lang/Object;
    flags: (0x0001) ACC_PUBLIC
    Code:
      stack=2, locals=1, args_size=1
         0: new           #4                  // class java/lang/Object
         3: dup
         4: invokespecial #1                  // Method java/lang/Object."<init>":()V
         7: areturn

It feels like new is doing allocation and system initialization, while invoking the constructor (<init>) does user init. But, smart runtimes can coalesce initialization when they know nobody will call the bluff — for example, by observing the object before the constructor finished. Can we show if this initialization subsuming works for Hotspot?

Experiment

Sure we can. To do this, we just want to take a test that initialized two variants of single-int-field class:

import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class UserInit {

    @Benchmark
    public Object init() {
        return new Init(42);
    }

    @Benchmark
    public Object initLeaky() {
        return new InitLeaky(42);
    }

    static class Init {
        private int x;
        public Init(int x) {
            this.x = x;
        }
    }

    static class InitLeaky {
        private int x;
        public InitLeaky(int x) {
            doSomething();
            this.x = x;
        }

        @CompilerControl(CompilerControl.Mode.DONT_INLINE)
        void doSomething() {
            // intentionally left blank
        }
    }
}

This test is specially crafted to forbid inlining of empty doSomething(), forcing optimizers to assume that something accesses x downstream. In other words, it would effectively leak the object to some external code — because we cannot say if code in doSomething() actually leaks it.

It is better to run with -XX:+UseParallelGC -XX:-TieredCompilation -XX:-UseBiasedLocking to make generated code more understandable — this is an educational exercise anyway. JMH’s -prof perfasm is perfect to dump the generated code for these tests.

This is the Init case:

                                                  ; ------- allocation ----------
0x00007efdc466d4cc: mov    0x60(%r15),%rax          ; TLAB allocation below
0x00007efdc466d4d0: mov    %rax,%r10
0x00007efdc466d4d3: add    $0x10,%r10
0x00007efdc466d4d7: cmp    0x70(%r15),%r10
0x00007efdc466d4db: jae    0x00007efdc466d50a
0x00007efdc466d4dd: mov    %r10,0x60(%r15)
0x00007efdc466d4e1: prefetchnta 0xc0(%r10)
                                                  ; ------- /allocation ---------
                                                  ; ------- system init ---------
0x00007efdc466d4e9: movq   $0x1,(%rax)              ; put mark word header
0x00007efdc466d4f0: movl   $0xf8021bc4,0x8(%rax)    ; put class word header
                                                  ; ...... system/user init .....
0x00007efdc466d4f7: movl   $0x2a,0xc(%rax)          ; x = 42.
                                                  ; -------- /user init ---------

You can see TLAB allocation, initialization of object metadata, and then coalesced system+user initialization of the field. This changes quite a bit for InitLeaky case:

                                                  ; ------- allocation ----------
0x00007fc69571bf4c: mov    0x60(%r15),%rax
0x00007fc69571bf50: mov    %rax,%r10
0x00007fc69571bf53: add    $0x10,%r10
0x00007fc69571bf57: cmp    0x70(%r15),%r10
0x00007fc69571bf5b: jae    0x00007fc69571bf9e
0x00007fc69571bf5d: mov    %r10,0x60(%r15)
0x00007fc69571bf61: prefetchnta 0xc0(%r10)
                                                  ; ------- /allocation ---------
                                                  ; ------- system init ---------
0x00007fc69571bf69: movq   $0x1,(%rax)              ; put mark word header
0x00007fc69571bf70: movl   $0xf8021bc4,0x8(%rax)    ; put class word header
0x00007fc69571bf77: mov    %r12d,0xc(%rax)          ; x = 0 (%r12 happens to hold 0)
                                                  ; ------- /system init --------
                                                  ; -------- user init ----------
0x00007fc69571bf7b: mov    %rax,%rbp
0x00007fc69571bf7e: mov    %rbp,%rsi
0x00007fc69571bf81: xchg   %ax,%ax
0x00007fc69571bf83: callq  0x00007fc68e269be0       ; call doSomething()
0x00007fc69571bf88: movl   $0x2a,0xc(%rbp)          ; x = 42
                                                  ; ------ /user init ------

Here, since optimizers cannot figure if the value of x is needed, they have to assume the worst, and perform system initialization first, and only then finish up user init.

Observations

While textbook definition is sound, and bytecode reflects the same definition, the optimizers may do magic undercover to optimize performance, as long as it would not yield surprising behaviors. From compiler perspective, this is a trivial optimization, but from the conceptual point of view it operates over the theoretical "staging" boundaries.