JVM Anatomy Quark #4: TLAB allocation

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

What is TLAB allocation? Pointer-bump allocation? Who is responsible for allocating objects anyway?

Theory

Most of the time we do new MyClass(), the runtime environment has to allocate storage for the instance in question. The textbook GC (memory manager) interface for allocation is very simple:

 ref Allocate(T type);
 ref AllocateArray(T type, int size);

Of course, since memory managers are usually written in the language different from the language runtime is targeted by (e.g. Java targets JVM, yet HotSpot JVM is written in C++), the interface gets murkier. For example, such a call from Java program needs to transit into native VM code. Does it cost much? Probably. Does the memory manager have to cope with multiple threads begging for memory? For sure.

So to optimize this, we may instead allow threads to allocate the entire blocks of memory for their needs, and only transit to VM to get a new block. In Hotspot, these blocks are called Thread Local Allocation Buffers (TLABs), and there is a sophisticated machinery built to support them. Notice that TLABs are thread-local in the temporal sense, meaning they act like the buffers to accept current allocations. They still are parts of Java heap, the thread can still write the reference to a newly allocated object into the field outside of TLAB, etc.

All known OpenJDK GCs support TLAB allocation. This part of VM code is pretty well shared among them. All Hotspot compilers support TLAB allocation, so you would usually see the generated code for object allocation like this:

0x00007f3e6bb617cc: mov    0x60(%r15),%rax        ; TLAB "current"
0x00007f3e6bb617d0: mov    %rax,%r10              ; tmp = current
0x00007f3e6bb617d3: add    $0x10,%r10             ; tmp += 16 (object size)
0x00007f3e6bb617d7: cmp    0x70(%r15),%r10        ; tmp > tlab_size?
0x00007f3e6bb617db: jae    0x00007f3e6bb61807     ; TLAB is done, jump and request another one
0x00007f3e6bb617dd: mov    %r10,0x60(%r15)        ; current = tmp (TLAB is fine, alloc!)
0x00007f3e6bb617e1: prefetchnta 0xc0(%r10)        ; ...
0x00007f3e6bb617e9: movq   $0x1,(%rax)            ; store header to (obj+0)
0x00007f3e6bb617f0: movl   $0xf80001dd,0x8(%rax)  ; store klass to (obj+8)
0x00007f3e6bb617f7: mov    %r12d,0xc(%rax)        ; zero out the rest of the object

The allocation path is inlined in the generated code, and as such does not require calling into GC to allocate the object. If we are requesting to allocate the object that depletes the TLAB, or the objects is large enough to never fit into the TLAB, then we take a "slow path", and either satisfy the allocation there, or come back with a fresh TLAB. Notice how the most frequent "normal" path is just adding the object size to TLAB current cursor, and then moving on.

This is why this allocation mechanism is sometimes called "pointer bump allocation". Pointer bump requires a contiguous chunk of memory to allocate to, though — which brings back the need for heap compaction. Notice how CMS does free-list allocation in "old" generation, thus enabling concurrent sweep, but it has compacting stop-the-world "young" collections, that benefit from pointer bump allocation! A much lower quantity of objects that survived the young collection would pay the cost of free list allocation.

For the sake of experiment, we can turn TLAB machinery off with -XX:-UseTLAB. Then, all allocations would take into the native method, like this:

-   17.12%     0.00%  org.openjdk.All  perf-31615.map
   - 0x7faaa3b2d125
      - 16.59% OptoRuntime::new_instance_C
         - 11.49% InstanceKlass::allocate_instance
              2.33% BlahBlahBlahCollectedHeap::mem_allocate  <---- entry point to GC
              0.35% AllocTracer::send_allocation_outside_tlab_event

…but, as you would see below, it is usually a bad idea.

Experiment

As usual, let us try to construct an experiment to see TLAB allocation in action. Since the machinery is shared by all GC implementations there, in makes sense to minimize the impact of other parts of runtime by using experimental Epsilon GC. Indeed, it implements allocation only, and thus provides a good research vessel for this work.

Quickly drafting a workload: allocating 50M objects (why not?), and running with JMH with SingleShot mode to get statistics and profiling for free. You could do this with a standalone test too, but SingleShot is just too convenient here.

@Warmup(iterations = 3)
@Measurement(iterations = 3)
@Fork(3)
@BenchmarkMode(Mode.SingleShotTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class AllocArray {
    @Benchmark
    public Object test() {
        final int size = 50_000_000;
        Object[] objects = new Object[size];
        for (int c = 0; c < size; c++) {
            objects[c] = new Object();
        }
        return objects;
    }
}

This test allocates 50M objects in a single thread. This is empirically selected to make 20 GB heap last for at least 6 iterations, as we would see next. The experimental -XX:EpsilonTLABSize option is used to control the TLAB sizing exactly. Other OpenJDK GCs share the adaptive TLAB sizing policy that selects the sizes based on allocation pressure and other concerns. For our performance tests it is easier to nail the TLAB size.

Without further ado, these are the results:

Benchmark                     Mode  Cnt     Score    Error   Units

# Times, lower is better                                            # TLAB size
AllocArray.test                 ss    9   548.462 ±  6.989   ms/op  #      1 KB
AllocArray.test                 ss    9   268.037 ± 10.966   ms/op  #      4 KB
AllocArray.test                 ss    9   230.726 ±  4.119   ms/op  #     16 KB
AllocArray.test                 ss    9   223.075 ±  2.267   ms/op  #    256 KB
AllocArray.test                 ss    9   225.404 ± 17.080   ms/op  #   1024 KB

# Allocation rates, higher is better
AllocArray.test:·gc.alloc.rate  ss    9  1816.094 ± 13.681  MB/sec  #      1 KB
AllocArray.test:·gc.alloc.rate  ss    9  2481.909 ± 35.566  MB/sec  #      4 KB
AllocArray.test:·gc.alloc.rate  ss    9  2608.336 ± 14.693  MB/sec  #     16 KB
AllocArray.test:·gc.alloc.rate  ss    9  2635.857 ±  8.229  MB/sec  #    256 KB
AllocArray.test:·gc.alloc.rate  ss    9  2627.845 ± 60.514  MB/sec  #   1024 KB

Notice how we are able to pull off 2.5 GB/sec allocation rate in a single thread. With 16 byte objects, this means 160 million objects per second. In multi-threaded workloads, the allocation rates may reach tens of gigabytes per second. Of course, once TLAB size gets smaller, both the allocation costs go up, and allocation rate goes down. Unfortunately, we cannot make TLABs lower than 1 KB, because Hotspot mechanics needs some space wasted there, but we can turn off TLAB machinery completely, to see the performance impact:

Benchmark                      Mode  Cnt     Score   Error    Units

# -XX:-UseTLAB
AllocArray.test                  ss    9  2784.988 ± 18.925   ms/op
AllocArray.test:·gc.alloc.rate   ss    9   580.533 ±  3.342  MB/sec

Whoa, the allocation rate goes down at least 5x, and time to execute goes up 10x! And this is not even starting to touch what a collector has to do when multiple threads are asking for memory (probably contended atomics), or if it needs to look up where to allocate the memory from (try to allocate fast from free lists!). For Epsilon, the allocation path in GC is a single compare-and-set — because it issues the memory blocks by pointer-bumps itself. If you add one additional thread — totaling just two running threads — without TLABs, everything goes downhill:

Benchmark                            Mode  Cnt           Score       Error   Units

# TLAB = 4M (default for Epsilon)
AllocArray.test                        ss    9         407.729 ±     7.672   ms/op
AllocArray.test:·gc.alloc.rate         ss    9        4190.670 ±    45.909  MB/sec

# -XX:-UseTLAB
AllocArray.test                        ss    9        8490.585 ±   410.518   ms/op
AllocArray.test:·gc.alloc.rate         ss    9         422.960 ±    19.320  MB/sec

This is 20x performance hit now. Project the slowness you would experience with larger thread counts!

Observations

TLABs are the workhorses of allocation machinery: they unload the concurrency bottlenecks of the allocators, provide a cheap allocation path, and improve performance all around. It is funny to consider that having TLABs is the way to experience more frequent GC pauses, just because the allocation is so damn cheap! In reverse, having no fast allocation path in any memory manager implementation for sure hides the memory reclamation performance problems. When comparing the memory managers, do understand both parts of the story, and how they relate to each other.