JVM Anatomy Quark #5: TLABs and Heap Parsability

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

Have you ever encountered the large int[] arrays that cannot be accounted for? Those that are seemingly allocated nowhere, but still consuming heap? Those that have some garbage-looking data in them?

Theory

In GC theory, there is an important property that good collectors try to maintain, heap parsability, that is, shaping the heap in such a way it could be parsed for objects, fields, etc. without complicated metadata supporting it. In OpenJDK, for example, many introspection tasks walk the heap with a simple loop like this:

HeapWord* cur = heap_start;
while (cur < heap_used) {
  object o = (object)cur;
  do_object(o);
  cur = cur + o->size();
}

That’s it! If heap is parsable, then we can assume there is a contiguous stream of objects from the start to the allocated end. This is not, strictly speaking, a required property, but it makes GC implementation, testing and debugging much easier.

Enter Thread Local Allocation Buffer (TLAB) machinery: now, each thread has its own TLAB it can currently allocate to. From the GC perspective, this means the entire TLAB is claimed. GC cannot easily know what threads are up to there: are they in the middle of bumping the TLAB cursor? What is the value for TLAB cursor anyway? It is possible that a thread just keeps it somewhere in the register (in OpenJDK, it is not) and never shows it to external observers. So, there is a problem: outsiders do not know what exactly happens in TLABs.

We might want to stop the threads to avoid their TLAB mutation, and then traverse the heap accurately, checking if what we are walking right now is the part of some TLAB. But there is a more convenient trick: why don’t we make heap parsable by inserting filler objects? That is, if we have:

 ...........|===================           ]............
            ^                  ^           ^
        TLAB start        TLAB used   TLAB end

…we can stop the threads, and ask them to allocate a dummy object in the rest of the TLAB to make their part of heap parsable:

 ...........|===================!!!!!!!!!!!]............
            ^                  ^           ^
        TLAB start        TLAB used   TLAB end

What is a good candidate for a dummy object? Of course, something that has variable length. Why not int[] array? Note that "putting" the object like this only amounts to putting out the array header, and letting heap mechanics to work out the rest, jumping over its contents. Once thread resumes allocating in TLAB, it can just overwrite whatever filler we allocated, like nothing happened.

The same thing, by the way, simplifies sweeping the heap. If we remove (sweep out) the object, it is convenient to place a filler in its place to keep heap walking routines happy.

Experiment

Can we see it in action? Of course we can. What we want is to start lots of threads that would claim some TLABs of their own, and one loner thread what will exhaust the Java heap, crashing with OutOfMemoryException, which we will use as the trigger for a heap dump.

Workload like this is fine:

import java.util.*;
import java.util.concurrent.*;

public class Fillers {
  public static void main(String... args) throws Exception {
    final int TRAKTORISTOV = 300;
    CountDownLatch cdl = new CountDownLatch(TRAKTORISTOV);
    for (int t = 0 ; t < TRAKTORISTOV; t++) {
      new Thread(() -> allocateAndWait(cdl)).start();
    }
    cdl.await();
    List<Object> l = new ArrayList<>();
    new Thread(() -> allocateAndDie(l)).start();
  }

  public static void allocateAndWait(CountDownLatch cdl) {
    Object o = new Object();  // Request a TLAB
    cdl.countDown();
    while (true) {
      try {
        Thread.sleep(1000);
      } catch (Exception e) {
        break;
      }
    }
    System.out.println(o); // Use the object
  }

  public static void allocateAndDie(Collection<Object> c) {
    while (true) {
      c.add(new Object());
    }
  }
}

Now, in order to get the predictable TLAB sizes, we can again use Epsilon GC. Running with -Xmx1G -Xms1G -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:+HeapDumpOnOutOfMemoryError quickly fails and produces the heap dump for us.

Opening this heap dump in Eclipse Memory Analyzer (MAT) — I like that tool a lot — we can see this class histogram:

Class Name                                 |   Objects | Shallow Heap |
-----------------------------------------------------------------------
                                           |           |              |
int[]                                      |     1,099 |  814,643,272 |
java.lang.Object                           | 9,181,912 |  146,910,592 |
java.lang.Object[]                         |     1,521 |  110,855,376 |
byte[]                                     |     6,928 |      348,896 |
java.lang.String                           |     5,840 |      140,160 |
java.util.HashMap$Node                     |     1,696 |       54,272 |
java.util.concurrent.ConcurrentHashMap$Node|     1,331 |       42,592 |
java.util.HashMap$Node[]                   |       413 |       42,032 |
char[]                                     |        50 |       37,432 |
-----------------------------------------------------------------------

See how int[] is the dominating heap consumer! These are our filler objects. Granted, this experiment has a few caveats.

First, we configured Epsilon to have static TLAB sizes. A high-performance collector would instead make the adaptive TLAB sizing decisions, which would minimize the heap slack when a thread had allocated a few objects, but still sits on troves of TLAB memory. This is one of the reasons why you don’t want to issue large TLABs without thinking twice. Still, it is possible to observe filler objects when an actively allocating thread has the large TLAB issued to it, and it is only half way there in filling it up with real data.

Second, we have configured MAT to show us unreachable objects. These filler objects are, by definition, unreachable. Their presence in the heap dumps is just a side effect of heap dumping using heap parsability property to walk the heap. These objects do not really exist, and a good heap dump analyzer tool will happily filter them out for you — this might be one of the reasons why a crashing 1G heap dump has only, say, 900 MB worth of objects in it.

Observations

Having TLABs is fun. Having heap parsability is fun too. Combining both is even funnier, and sometimes leaks out internal trickery. If you see a surprising behavior from any runtime, you might be looking at some clever trick!