About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Question
What is so expensive about creating new objects? What defines the object instantiation performance?
Theory
If you look closely at object instantiation paths on larger objects, you will inevitably wonder how the different parts scale, and what is the actual bottleneck in real world cases. We have seen that TLAB allocation seems very efficient, and that system initialization can be coalesced with user initialization. But in the end, we still have to do writes to memory — can we tell how much it costs?
Experiment
One primitive that can tell us the initialization part of story is plain old Java array. It needs to be initialized too, and its length is variable, which lets us see what is going on at different sizes. With that in mind, let us construct this benchmark:
import org.openjdk.jmh.annotations.*;
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class UA {
@Param({"1", "10", "100", "1000", "10000", "100000"})
int size;
@Benchmark
public byte[] java() {
return new byte[size];
}
}
It makes sense to run with the latest JDK 9 EA (the follow up test below requires it), with -XX:+UseParallelOldGC
to minimize GC overhead, and -Xmx20g -Xms20g -Xmn18g
to keep most of the heap for these new allocations. Without further delay, this is what you will get on a good development desktop (like mine i7-4790K, 4.0 GHz, Linux x86_64), when all 8 hardware threads are busy:
Benchmark (size) Mode Cnt Score Error Units
# Time to allocate
UA.java 1 avgt 15 20.307 ± 4.532 ns/op
UA.java 10 avgt 15 26.657 ± 6.072 ns/op
UA.java 100 avgt 15 106.632 ± 34.742 ns/op
UA.java 1000 avgt 15 681.176 ± 124.980 ns/op
UA.java 10000 avgt 15 4576.433 ± 909.956 ns/op
UA.java 100000 avgt 15 44881.095 ± 13765.440 ns/op
# Allocation rate
UA.java:·gc.alloc.rate 1 avgt 15 6228.153 ± 1059.385 MB/sec
UA.java:·gc.alloc.rate 10 avgt 15 6335.809 ± 986.395 MB/sec
UA.java:·gc.alloc.rate 100 avgt 15 6126.333 ± 1354.964 MB/sec
UA.java:·gc.alloc.rate 1000 avgt 15 7772.263 ± 1263.453 MB/sec
UA.java:·gc.alloc.rate 10000 avgt 15 11518.422 ± 2155.516 MB/sec
UA.java:·gc.alloc.rate 100000 avgt 15 12039.594 ± 2724.242 MB/sec
One can see that the allocation takes around 20 ns (single-threaded is much lower, but this average is skewed by hyper-threads), which goes gradually up to 40 us for 100K array. If you look at the allocation rate, then you will see it saturates at around 12 GB/sec. These experiments form the basis for other performance tests, by the way: it is important to understand the order of memory bandwidth / allocation rate you can pull off on a particular machine.
Can we see what code dominates the execution time? Of course we can, again, with JMH’s -prof perfasm
. For -p size=100000
, it will print out the hottest code like this:
0x00007f1f094f650b: movq $0x1,(%rdx) ; store mark word
0.00% 0x00007f1f094f6512: prefetchnta 0xc0(%r9)
0.64% 0x00007f1f094f651a: movl $0xf80000f5,0x8(%rdx) ; store klass word
0.02% 0x00007f1f094f6521: mov %r11d,0xc(%rdx) ; store array length
0x00007f1f094f6525: prefetchnta 0x100(%r9)
0.05% 0x00007f1f094f652d: prefetchnta 0x140(%r9)
0.07% 0x00007f1f094f6535: prefetchnta 0x180(%r9)
0.09% 0x00007f1f094f653d: shr $0x3,%rcx
0.00% 0x00007f1f094f6541: add $0xfffffffffffffffe,%rcx
0x00007f1f094f6545: xor %rax,%rax
0x00007f1f094f6548: cmp $0x8,%rcx
╭ 0x00007f1f094f654c: jg 0x00007f1f094f655e ; large enough? jump
│ 0x00007f1f094f654e: dec %rcx
│╭ 0x00007f1f094f6551: js 0x00007f1f094f6565 ; zero length? jump
││↗ 0x00007f1f094f6553: mov %rax,(%rdi,%rcx,8) ; small loop init
│││ 0x00007f1f094f6557: dec %rcx
││╰ 0x00007f1f094f655a: jge 0x00007f1f094f6553
││ ╭ 0x00007f1f094f655c: jmp 0x00007f1f094f6565
↘│ │ 0x00007f1f094f655e: shl $0x3,%rcx
89.12% │ │ 0x00007f1f094f6562: rep rex.W stos %al,%es:(%rdi) ; large loop init
0.20% ↘ ↘ 0x00007f1f094f6565: mov %r8,(%rsp)
You may find this code shape familiar from "TLAB Allocation" and "New Object Stages" issues in the series. What is interesting here, is that we have to initialize much larger chunk of array data. For this reason, we see the inlined rep stos
sequence on x86 — which repeatedly stores the given number of given byte — and seems most effective on recent x86-s. One can spy with a little eye that there is also an initializing loop for small arrays (smaller or equal of 8 elements) — rep stos
has upfront startup costs, so smaller loop is beneficial there.
This example shows that for large objects/arrays initialization costs would dominate the performance. If objects/arrays are small, then writes of metadata (headers, array lengths) would dominate. Small array case would not be significantly different from small object case.
Can we guesstimate what would the performance be, if we somehow avoid initialization? The compilers are routinely coalesce system and user initialization, but can we get no initialization at all? Granted, it would serve no practical purpose to have uninitialized object, because we will still fill it with user data later — but synthetic tests are fun, am I right?
Turns out, there is an Unsafe
method that allocates uninitialized arrays, so we may use it to see the benefits. Unsafe
is not Java, it does not play by Java rules, and it sometimes even defies the JVM rules. It is not the API for the public use, and it is there for internal JDK use and JDK-VM interoperability. You can’t have it, it is not guaranteed to work for you, and it can go away, crash and burn without telling you.
Still, we can use it in synthetics, like this:
import jdk.internal.misc.Unsafe;
import org.openjdk.jmh.annotations.*;
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class UA {
static Unsafe U;
static {
try {
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
U = (Unsafe) field.get(null);
} catch (Exception e) {
throw new IllegalStateException(e);
}
}
@Param({"1", "10", "100", "1000", "10000", "100000"})
int size;
@Benchmark
public byte[] unsafe() {
return (byte[]) U.allocateUninitializedArray(byte.class, size);
}
}
Why don’t we run it? Okay:
Benchmark (size) Mode Cnt Score Error Units
UA.unsafe 1 avgt 15 19.766 ± 4.002 ns/op
UA.unsafe 10 avgt 15 27.486 ± 7.005 ns/op
UA.unsafe 100 avgt 15 80.040 ± 15.754 ns/op
UA.unsafe 1000 avgt 15 156.041 ± 0.552 ns/op
UA.unsafe 10000 avgt 15 162.384 ± 1.448 ns/op
UA.unsafe 100000 avgt 15 309.769 ± 2.819 ns/op
UA.unsafe:·gc.alloc.rate 1 avgt 15 6359.987 ± 928.472 MB/sec
UA.unsafe:·gc.alloc.rate 10 avgt 15 6193.103 ± 1160.353 MB/sec
UA.unsafe:·gc.alloc.rate 100 avgt 15 7855.147 ± 1313.314 MB/sec
UA.unsafe:·gc.alloc.rate 1000 avgt 15 33171.384 ± 153.645 MB/sec
UA.unsafe:·gc.alloc.rate 10000 avgt 15 315740.299 ± 3678.459 MB/sec
UA.unsafe:·gc.alloc.rate 100000 avgt 15 1650860.763 ± 14498.920 MB/sec
Holy crap! 100K arrays are allocated with 1.6 terabytes per second rate. Can we see where we spend time now?
0x00007f65fd722c74: prefetchnta 0xc0(%r11)
66.06% 0x00007f65fd722c7c: movq $0x1,(%rax) ; store mark word
0.40% 0x00007f65fd722c83: prefetchnta 0x100(%r11)
4.43% 0x00007f65fd722c8b: movl $0xf80000f5,0x8(%rax) ; store class word
0.01% 0x00007f65fd722c92: mov %edx,0xc(%rax) ; store array length
0x00007f65fd722c95: prefetchnta 0x140(%r11)
5.18% 0x00007f65fd722c9d: prefetchnta 0x180(%r11)
4.99% 0x00007f65fd722ca5: mov %r8,0x40(%rsp)
0x00007f65fd722caa: mov %rax,%rdx
Oh yeah, touching the memory to push the metadata out. The cycles accounting is skewed towards the prefetches, because they are now paying the most of the cost for pre-accessing memory for upcoming writes.
One might wonder what toll that exerts on GC, and the answer is: not that much! The objects are almost all dead, and so any mark-(copy|sweep|compact) GC would breeze through the workload like this. When objects start surviving at TB/sec rate, the picture gets interesting. Some GC guys I know call these things "impossible workloads" — because they both impossible in reality, and impossible to deal with. Try to drink from a firehose to learn this in practice.
Anyhow, we can see that with pure allocations, GCs are surviving fine. With the same workload, we can see what are the apparent application pauses, by using JMH’s -prof pauses
profiler. It runs a high-priority thread and records the perceived pauses:
Benchmark (size) Mode Cnt Score Error Units
UA.unsafe 100000 avgt 5 315.732 ± 5.133 ns/op
UA.unsafe:·pauses 100000 avgt 84 537.018 ms
UA.unsafe:·pauses.avg 100000 avgt 6.393 ms
UA.unsafe:·pauses.count 100000 avgt 84.000 #
UA.unsafe:·pauses.p0.00 100000 avgt 2.560 ms
UA.unsafe:·pauses.p0.50 100000 avgt 6.148 ms
UA.unsafe:·pauses.p0.90 100000 avgt 9.642 ms
UA.unsafe:·pauses.p0.95 100000 avgt 9.802 ms
UA.unsafe:·pauses.p0.99 100000 avgt 14.418 ms
UA.unsafe:·pauses.p0.999 100000 avgt 14.418 ms
UA.unsafe:·pauses.p0.9999 100000 avgt 14.418 ms
UA.unsafe:·pauses.p1.00 100000 avgt 14.418 ms
So, it had detected around 84 pauses, the longest is 14 ms, while the average is about 6 ms. Profilers like these are inherently imprecise, because they contend on CPUs with workload threads, they get into the mercy of OS scheduler, etc.
In many cases, it is better to enable the JVM to tell when it stops the application threads. This can be done with JMH’s -prof safepoints
profiler, which records the "safe point", "stop the world" events when all application threads are stopped, and VM does its work. GC pauses are naturally the subset of safepoint events.
Benchmark (size) Mode Cnt Score Error Units
UA.unsafe 100000 avgt 5 328.247 ± 34.450 ns/op
UA.unsafe:·safepoints.interval 100000 avgt 5043.000 ms
UA.unsafe:·safepoints.pause 100000 avgt 639 617.796 ms
UA.unsafe:·safepoints.pause.avg 100000 avgt 0.967 ms
UA.unsafe:·safepoints.pause.count 100000 avgt 639.000 #
UA.unsafe:·safepoints.pause.p0.00 100000 avgt 0.433 ms
UA.unsafe:·safepoints.pause.p0.50 100000 avgt 0.627 ms
UA.unsafe:·safepoints.pause.p0.90 100000 avgt 2.150 ms
UA.unsafe:·safepoints.pause.p0.95 100000 avgt 2.241 ms
UA.unsafe:·safepoints.pause.p0.99 100000 avgt 2.979 ms
UA.unsafe:·safepoints.pause.p0.999 100000 avgt 12.599 ms
UA.unsafe:·safepoints.pause.p0.9999 100000 avgt 12.599 ms
UA.unsafe:·safepoints.pause.p1.00 100000 avgt 12.599 ms
See, this profiler records 639 safepoints, with average time of less than 1 ms, and the largest time of 12 ms. Not bad, taking into account 1.6 TB/sec allocation rate!
Observations
The initialization costs are the significant part of object/array instantiation. With TLAB allocation, the object/array creation speed is largely dominated by either metadata writes (for smaller things), or the content initialization (for larger things). The allocation rate alone is not always a good predictor of performance, as you can manage huge allocation rates if you runtime pulls weird tricks on you.