JVM Anatomy Quark #13: Intergenerational Barriers

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

Epsilon GC JEP mentions in its "Motivation" section, among other things:

"Last-drop performance improvements: … Even for non-allocating workloads, the choice of GC means choosing the set of GC barriers that the workload has to use, even if no GC cycle actually happens. Most OpenJDK GCs are generational, and they emit at least one reference write barrier. Avoiding this barrier brings the last bit of performance improvement."

What gives?

Theory

If any garbage collector wants to collect parts of managed heap without touching the entire heap, then it has to understand what references are pointing into that collected part. Otherwise, it cannot reliably tell what is reachable in that collected part, thus having to conservatively assume everything is reachable… and that puts it in position where nothing can be treated as garbage, burning the idea to the ground. Second, it wants to understand which locations to update, if it ever moves any objects in that collected part — this concerns mostly "outside" pointers that would not be visited when processing the collected part.

The simplest (and very effective) way to split the heap in parts is to segregate objects by age, that is, introduce generations. The key idea here is weak generational hypothesis that claims "new objects die young". The practical thing that lets capitalize on weak generational hypothesis is that in marking collectors, the performance is dependent on the number of surviving objects. Which means, we can have a young generation, where everything is mostly dead, process it quickly and maybe more frequently, while old generation is kept aside.

However, there could be references from old to young generations that you need to take care of when collecting young generation alone. Old generation is usually collected along with the entire heap, young→old references can be omitted from tracking. Ditto for young→young and old→old references, because their referees would get visited in the same collection.

gc partial parallel

Taking Parallel GC from OpenJDK as the simplest example, it records the old→young references with the help of Card Table: the coarse bitmap that spans the old generation. When store happens, it needs to flip a bit in that card table. That bit would mean that the part of old generation "under the card" can potentially have pointers to young gen, and it needs to be scanned when young gen is collected. For all this to work, reference stores have to be augmented with write barriers, little pieces of code that do card table management.

Practice

Does that really matter in practice? It sometimes does! Take this workload as example:

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgsAppend = {"-Xms4g", "-Xmx4g", "-Xmn2g"})
@Threads(Threads.MAX)
public class HashWalkBench {

    @Param({"16", "256", "4096", "65536", "1048576", "16777216"})
    int size;

    Map<String, String> map;

    @Setup
    public void setup() throws Exception {
        create();
        System.gc();
    }

    private void create() {
        String[] keys = new String[size];
        String[] values = new String[size];
        for (int c = 0; c < size; c++) {
            keys[c] = "Key" + c;
            values[c] = "Value" + c;
        }
        map = new HashMap<>(size);
        for (int c = 0; c < size; c++) {
            String key = keys[c];
            String val = values[c];
            map.put(key, val);
        }
    }

    @Benchmark
    public void walk(Blackhole bh) {
        for (Map.Entry<String, String> e : map.entrySet()) {
            bh.consume(e.getKey());
            bh.consume(e.getValue());
        }
    }
}

It creates a HashMap, performs a few rounds of GC, and then walks it. Running it with Epsilon vs Parallel will yield:

Benchmark             (size)  Mode  Cnt       Score      Error  Units

# Epsilon
HashWalkBench.walk        16  avgt   15       0.238 ±    0.005  us/op
HashWalkBench.walk       256  avgt   15       3.638 ±    0.072  us/op
HashWalkBench.walk      4096  avgt   15      59.222 ±    1.974  us/op
HashWalkBench.walk     65536  avgt   15    1102.590 ±    4.331  us/op
HashWalkBench.walk   1048576  avgt   15   19683.680 ±  195.086  us/op
HashWalkBench.walk  16777216  avgt   15  328319.596 ± 7137.066  us/op

# Parallel
HashWalkBench.walk        16  avgt   15       0.240 ±    0.001  us/op
HashWalkBench.walk       256  avgt   15       3.679 ±    0.078  us/op
HashWalkBench.walk      4096  avgt   15      64.778 ±    0.275  us/op
HashWalkBench.walk     65536  avgt   15    1377.634 ±   28.132  us/op
HashWalkBench.walk   1048576  avgt   15   25223.994 ±  853.601  us/op
HashWalkBench.walk  16777216  avgt   15  400679.042 ± 8155.414  us/op

Wait, but haven’t we talked about the stores? Where is the store here? Apart from the infrastructure stores handled by JMH itself, the interesting thing happens behind the for-each loop cover. It implicitly instantiates HashMap.EntrySetIterator, which saves the current entry in its field. (It would have been scalarized, if escape analysis was not this fragile).

We can clearly see that store and the associated barrier in the generated code:

1.58%    0.91%   ...a4e2c: mov    %edi,0x18(%r9)       ; %r9 = iterator, 0x18(%r9) = field
0.27%    0.33%   ...a4e30: mov    %r9,%r11             ; r11 = &iterator
0.26%    0.38%   ...a4e33: shr    $0x9,%r11            ; r11 = r11 >> 9
0.13%    0.20%   ...a4e37: movabs $0x7f2c535aa000,%rbx ; rbx = card table base
0.58%    0.57%   ...a4e41: mov    %r12b,(%rbx,%r11,1)  ; put 0 to (rbx+r11)

There are a few observations here:

The card mark sets the entire byte, not just a bit. This avoids synchronization: most hardware can perform the byte store atomically, without touching the data around it. This makes card table beefier than it could theoretically be, but still quite dense: 1 byte of card table per 512 bytes of heap, notice the shift by 9.
The barrier happens unconditionally, while we only need to record old→young references. This seems practical: we don’t have excess branches on hot path, and we trade card table that spans the entire heap, not only the old. Given how dense the card table is, we would introduce only a tiny fraction of additional footprint. This also helps to move the ephemeral boundary between young and old generations in collectors that can tune themselves, without having to patch the code.
The card table address is encoded in the generated code, which is practical again, because it is a native unmovable structure. That saves memory loads, because we would have to poll the card table address from somewhere otherwise.
The card mark "set" is actually encoded as "0". This is again practical, because we ca then reuse the zero registers — especially on architectures that have them explicitly — to get the source operand. It does not matter what value to use for card table initialization, 0 or 1, later in native GC code.

This performance picture is further corroborated by hardware counters (normalized to operation, and then divided by 1M):

Benchmark                                  (size)  Mode  Cnt    Score    Error  Units

# Epsilon
HashWalkBench.walk                        1048576  avgt   15    0.019 ±  0.001  us/op
HashWalkBench.walk:L1-dcache-load-misses  1048576  avgt    3    0.389 ±  0.495   #/op
HashWalkBench.walk:L1-dcache-loads        1048576  avgt    3   25.439 ±  2.411   #/op
HashWalkBench.walk:L1-dcache-stores       1048576  avgt    3   20.090 ±  1.184   #/op
HashWalkBench.walk:cycles                 1048576  avgt    3   75.230 ± 11.333   #/op
HashWalkBench.walk:instructions           1048576  avgt    3   90.075 ± 10.484   #/op

# Parallel
HashWalkBench.walk                        1048576  avgt   15    0.024 ±  0.001  us/op
HashWalkBench.walk:L1-dcache-load-misses  1048576  avgt    3    1.156 ±  0.360   #/op
HashWalkBench.walk:L1-dcache-loads        1048576  avgt    3   25.417 ±  1.711   #/op
HashWalkBench.walk:L1-dcache-stores       1048576  avgt    3   23.265 ±  3.552   #/op
HashWalkBench.walk:cycles                 1048576  avgt    3   97.435 ± 69.688   #/op
HashWalkBench.walk:instructions           1048576  avgt    3  102.477 ± 12.689   #/op

So, parallel does the same amount of loads (thanks to encoded card table pointer), and it makes 3 additional stores, that cost around 22 additional cycles and 12 instructions. This addition seems to correspond to three write barriers in this particular workload. Notice that L1 cache misses are ever so slightly greater too: because card mark stores pollute it, reducing the effective cache capacity for the application.

Observations

GCs are usually coming with the set of barriers that affect application performance even when no actual GC work is happening. Even in the case of very basic generational collector like Serial/Parallel, you have at least one reference store barrier that has to record inter-generational barriers. More advanced collectors like G1 have even more sophisticated barriers that track references between the regions. In some cases, that cost is painful to warrant tricks to avoid it, including the no-op GC, like Epsilon.