JVM Anatomy Quark #7: Initialization Costs

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

What is so expensive about creating new objects? What defines the object instantiation performance?

Theory

If you look closely at object instantiation paths on larger objects, you will inevitably wonder how the different parts scale, and what is the actual bottleneck in real world cases. We have seen that TLAB allocation seems very efficient, and that system initialization can be coalesced with user initialization. But in the end, we still have to do writes to memory — can we tell how much it costs?

Experiment

One primitive that can tell us the initialization part of story is plain old Java array. It needs to be initialized too, and its length is variable, which lets us see what is going on at different sizes. With that in mind, let us construct this benchmark:

import org.openjdk.jmh.annotations.*;

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class UA {
    @Param({"1", "10", "100", "1000", "10000", "100000"})
    int size;

    @Benchmark
    public byte[] java() {
        return new byte[size];
    }
}

It makes sense to run with the latest JDK 9 EA (the follow up test below requires it), with -XX:+UseParallelOldGC to minimize GC overhead, and -Xmx20g -Xms20g -Xmn18g to keep most of the heap for these new allocations. Without further delay, this is what you will get on a good development desktop (like mine i7-4790K, 4.0 GHz, Linux x86_64), when all 8 hardware threads are busy:

Benchmark                  (size)  Mode  Cnt       Score       Error   Units

# Time to allocate
UA.java                         1  avgt   15      20.307 ±     4.532   ns/op
UA.java                        10  avgt   15      26.657 ±     6.072   ns/op
UA.java                       100  avgt   15     106.632 ±    34.742   ns/op
UA.java                      1000  avgt   15     681.176 ±   124.980   ns/op
UA.java                     10000  avgt   15    4576.433 ±   909.956   ns/op
UA.java                    100000  avgt   15   44881.095 ± 13765.440   ns/op

# Allocation rate
UA.java:·gc.alloc.rate          1  avgt   15    6228.153 ±  1059.385  MB/sec
UA.java:·gc.alloc.rate         10  avgt   15    6335.809 ±   986.395  MB/sec
UA.java:·gc.alloc.rate        100  avgt   15    6126.333 ±  1354.964  MB/sec
UA.java:·gc.alloc.rate       1000  avgt   15    7772.263 ±  1263.453  MB/sec
UA.java:·gc.alloc.rate      10000  avgt   15   11518.422 ±  2155.516  MB/sec
UA.java:·gc.alloc.rate     100000  avgt   15   12039.594 ±  2724.242  MB/sec

One can see that the allocation takes around 20 ns (single-threaded is much lower, but this average is skewed by hyper-threads), which goes gradually up to 40 us for 100K array. If you look at the allocation rate, then you will see it saturates at around 12 GB/sec. These experiments form the basis for other performance tests, by the way: it is important to understand the order of memory bandwidth / allocation rate you can pull off on a particular machine.

Can we see what code dominates the execution time? Of course we can, again, with JMH’s -prof perfasm. For -p size=100000, it will print out the hottest code like this:

              0x00007f1f094f650b: movq   $0x1,(%rdx)              ; store mark word
  0.00%       0x00007f1f094f6512: prefetchnta 0xc0(%r9)
  0.64%       0x00007f1f094f651a: movl   $0xf80000f5,0x8(%rdx)    ; store klass word
  0.02%       0x00007f1f094f6521: mov    %r11d,0xc(%rdx)          ; store array length
              0x00007f1f094f6525: prefetchnta 0x100(%r9)
  0.05%       0x00007f1f094f652d: prefetchnta 0x140(%r9)
  0.07%       0x00007f1f094f6535: prefetchnta 0x180(%r9)
  0.09%       0x00007f1f094f653d: shr    $0x3,%rcx
  0.00%       0x00007f1f094f6541: add    $0xfffffffffffffffe,%rcx
              0x00007f1f094f6545: xor    %rax,%rax
              0x00007f1f094f6548: cmp    $0x8,%rcx
        ╭     0x00007f1f094f654c: jg     0x00007f1f094f655e       ; large enough? jump
        │     0x00007f1f094f654e: dec    %rcx
        │╭    0x00007f1f094f6551: js     0x00007f1f094f6565       ; zero length? jump
        ││↗   0x00007f1f094f6553: mov    %rax,(%rdi,%rcx,8)       ; small loop init
        │││   0x00007f1f094f6557: dec    %rcx
        ││╰   0x00007f1f094f655a: jge    0x00007f1f094f6553
        ││ ╭  0x00007f1f094f655c: jmp    0x00007f1f094f6565
        ↘│ │  0x00007f1f094f655e: shl    $0x3,%rcx
 89.12%  │ │  0x00007f1f094f6562: rep rex.W stos %al,%es:(%rdi)   ; large loop init
  0.20%  ↘ ↘  0x00007f1f094f6565: mov    %r8,(%rsp)

You may find this code shape familiar from "TLAB Allocation" and "New Object Stages" issues in the series. What is interesting here, is that we have to initialize much larger chunk of array data. For this reason, we see the inlined rep stos sequence on x86 — which repeatedly stores the given number of given byte — and seems most effective on recent x86-s. One can spy with a little eye that there is also an initializing loop for small arrays (smaller or equal of 8 elements) — rep stos has upfront startup costs, so smaller loop is beneficial there.

This example shows that for large objects/arrays initialization costs would dominate the performance. If objects/arrays are small, then writes of metadata (headers, array lengths) would dominate. Small array case would not be significantly different from small object case.

Can we guesstimate what would the performance be, if we somehow avoid initialization? The compilers are routinely coalesce system and user initialization, but can we get no initialization at all? Granted, it would serve no practical purpose to have uninitialized object, because we will still fill it with user data later — but synthetic tests are fun, am I right?

Turns out, there is an Unsafe method that allocates uninitialized arrays, so we may use it to see the benefits. Unsafe is not Java, it does not play by Java rules, and it sometimes even defies the JVM rules. It is not the API for the public use, and it is there for internal JDK use and JDK-VM interoperability. You can’t have it, it is not guaranteed to work for you, and it can go away, crash and burn without telling you.

Still, we can use it in synthetics, like this:

import jdk.internal.misc.Unsafe;
import org.openjdk.jmh.annotations.*;

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class UA {
    static Unsafe U;

    static {
        try {
            Field field = Unsafe.class.getDeclaredField("theUnsafe");
            field.setAccessible(true);
            U = (Unsafe) field.get(null);
        } catch (Exception e) {
            throw new IllegalStateException(e);
        }
    }

    @Param({"1", "10", "100", "1000", "10000", "100000"})
    int size;

    @Benchmark
    public byte[] unsafe() {
        return (byte[]) U.allocateUninitializedArray(byte.class, size);
    }
}

Why don’t we run it? Okay:

Benchmark                  (size)  Mode  Cnt        Score       Error   Units
UA.unsafe                       1  avgt   15       19.766 ±     4.002   ns/op
UA.unsafe                      10  avgt   15       27.486 ±     7.005   ns/op
UA.unsafe                     100  avgt   15       80.040 ±    15.754   ns/op
UA.unsafe                    1000  avgt   15      156.041 ±     0.552   ns/op
UA.unsafe                   10000  avgt   15      162.384 ±     1.448   ns/op
UA.unsafe                  100000  avgt   15      309.769 ±     2.819   ns/op

UA.unsafe:·gc.alloc.rate        1  avgt   15     6359.987 ±   928.472  MB/sec
UA.unsafe:·gc.alloc.rate       10  avgt   15     6193.103 ±  1160.353  MB/sec
UA.unsafe:·gc.alloc.rate      100  avgt   15     7855.147 ±  1313.314  MB/sec
UA.unsafe:·gc.alloc.rate     1000  avgt   15    33171.384 ±   153.645  MB/sec
UA.unsafe:·gc.alloc.rate    10000  avgt   15   315740.299 ±  3678.459  MB/sec
UA.unsafe:·gc.alloc.rate   100000  avgt   15  1650860.763 ± 14498.920  MB/sec

Holy crap! 100K arrays are allocated with 1.6 terabytes per second rate. Can we see where we spend time now?

          0x00007f65fd722c74: prefetchnta 0xc0(%r11)
 66.06%   0x00007f65fd722c7c: movq   $0x1,(%rax)           ; store mark word
  0.40%   0x00007f65fd722c83: prefetchnta 0x100(%r11)
  4.43%   0x00007f65fd722c8b: movl   $0xf80000f5,0x8(%rax) ; store class word
  0.01%   0x00007f65fd722c92: mov    %edx,0xc(%rax)        ; store array length
          0x00007f65fd722c95: prefetchnta 0x140(%r11)
  5.18%   0x00007f65fd722c9d: prefetchnta 0x180(%r11)
  4.99%   0x00007f65fd722ca5: mov    %r8,0x40(%rsp)
          0x00007f65fd722caa: mov    %rax,%rdx

Oh yeah, touching the memory to push the metadata out. The cycles accounting is skewed towards the prefetches, because they are now paying the most of the cost for pre-accessing memory for upcoming writes.

One might wonder what toll that exerts on GC, and the answer is: not that much! The objects are almost all dead, and so any mark-(copy|sweep|compact) GC would breeze through the workload like this. When objects start surviving at TB/sec rate, the picture gets interesting. Some GC guys I know call these things "impossible workloads" — because they both impossible in reality, and impossible to deal with. Try to drink from a firehose to learn this in practice.

Anyhow, we can see that with pure allocations, GCs are surviving fine. With the same workload, we can see what are the apparent application pauses, by using JMH’s -prof pauses profiler. It runs a high-priority thread and records the perceived pauses:

Benchmark                  (size)  Mode  Cnt    Score   Error  Units
UA.unsafe                  100000  avgt    5  315.732 ± 5.133  ns/op
UA.unsafe:·pauses          100000  avgt   84  537.018             ms
UA.unsafe:·pauses.avg      100000  avgt         6.393             ms
UA.unsafe:·pauses.count    100000  avgt        84.000              #
UA.unsafe:·pauses.p0.00    100000  avgt         2.560             ms
UA.unsafe:·pauses.p0.50    100000  avgt         6.148             ms
UA.unsafe:·pauses.p0.90    100000  avgt         9.642             ms
UA.unsafe:·pauses.p0.95    100000  avgt         9.802             ms
UA.unsafe:·pauses.p0.99    100000  avgt        14.418             ms
UA.unsafe:·pauses.p0.999   100000  avgt        14.418             ms
UA.unsafe:·pauses.p0.9999  100000  avgt        14.418             ms
UA.unsafe:·pauses.p1.00    100000  avgt        14.418             ms

So, it had detected around 84 pauses, the longest is 14 ms, while the average is about 6 ms. Profilers like these are inherently imprecise, because they contend on CPUs with workload threads, they get into the mercy of OS scheduler, etc.

In many cases, it is better to enable the JVM to tell when it stops the application threads. This can be done with JMH’s -prof safepoints profiler, which records the "safe point", "stop the world" events when all application threads are stopped, and VM does its work. GC pauses are naturally the subset of safepoint events.

Benchmark                            (size)  Mode  Cnt     Score    Error  Units
UA.unsafe                            100000  avgt    5   328.247 ± 34.450  ns/op
UA.unsafe:·safepoints.interval       100000  avgt       5043.000              ms
UA.unsafe:·safepoints.pause          100000  avgt  639   617.796              ms
UA.unsafe:·safepoints.pause.avg      100000  avgt          0.967              ms
UA.unsafe:·safepoints.pause.count    100000  avgt        639.000               #
UA.unsafe:·safepoints.pause.p0.00    100000  avgt          0.433              ms
UA.unsafe:·safepoints.pause.p0.50    100000  avgt          0.627              ms
UA.unsafe:·safepoints.pause.p0.90    100000  avgt          2.150              ms
UA.unsafe:·safepoints.pause.p0.95    100000  avgt          2.241              ms
UA.unsafe:·safepoints.pause.p0.99    100000  avgt          2.979              ms
UA.unsafe:·safepoints.pause.p0.999   100000  avgt         12.599              ms
UA.unsafe:·safepoints.pause.p0.9999  100000  avgt         12.599              ms
UA.unsafe:·safepoints.pause.p1.00    100000  avgt         12.599              ms

See, this profiler records 639 safepoints, with average time of less than 1 ms, and the largest time of 12 ms. Not bad, taking into account 1.6 TB/sec allocation rate!

Observations

The initialization costs are the significant part of object/array instantiation. With TLAB allocation, the object/array creation speed is largely dominated by either metadata writes (for smaller things), or the content initialization (for larger things). The allocation rate alone is not always a good predictor of performance, as you can manage huge allocation rates if you runtime pulls weird tricks on you.