About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Question
Epsilon GC JEP mentions in its "Motivation" section, among other things:
"Last-drop performance improvements: … Even for non-allocating workloads, the choice of GC means choosing the set of GC barriers that the workload has to use, even if no GC cycle actually happens. Most OpenJDK GCs are generational, and they emit at least one reference write barrier. Avoiding this barrier brings the last bit of performance improvement."
What gives?
Theory
If any garbage collector wants to collect parts of managed heap without touching the entire heap, then it has to understand what references are pointing into that collected part. Otherwise, it cannot reliably tell what is reachable in that collected part, thus having to conservatively assume everything is reachable… and that puts it in position where nothing can be treated as garbage, burning the idea to the ground. Second, it wants to understand which locations to update, if it ever moves any objects in that collected part — this concerns mostly "outside" pointers that would not be visited when processing the collected part.
The simplest (and very effective) way to split the heap in parts is to segregate objects by age, that is, introduce generations. The key idea here is weak generational hypothesis that claims "new objects die young". The practical thing that lets capitalize on weak generational hypothesis is that in marking collectors, the performance is dependent on the number of surviving objects. Which means, we can have a young generation, where everything is mostly dead, process it quickly and maybe more frequently, while old generation is kept aside.
However, there could be references from old to young generations that you need to take care of when collecting young generation alone. Old generation is usually collected along with the entire heap, young→old references can be omitted from tracking. Ditto for young→young and old→old references, because their referees would get visited in the same collection.
Taking Parallel GC from OpenJDK as the simplest example, it records the old→young references with the help of Card Table: the coarse bitmap that spans the old generation. When store happens, it needs to flip a bit in that card table. That bit would mean that the part of old generation "under the card" can potentially have pointers to young gen, and it needs to be scanned when young gen is collected. For all this to work, reference stores have to be augmented with write barriers, little pieces of code that do card table management.
Practice
Does that really matter in practice? It sometimes does! Take this workload as example:
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgsAppend = {"-Xms4g", "-Xmx4g", "-Xmn2g"})
@Threads(Threads.MAX)
public class HashWalkBench {
@Param({"16", "256", "4096", "65536", "1048576", "16777216"})
int size;
Map<String, String> map;
@Setup
public void setup() throws Exception {
create();
System.gc();
}
private void create() {
String[] keys = new String[size];
String[] values = new String[size];
for (int c = 0; c < size; c++) {
keys[c] = "Key" + c;
values[c] = "Value" + c;
}
map = new HashMap<>(size);
for (int c = 0; c < size; c++) {
String key = keys[c];
String val = values[c];
map.put(key, val);
}
}
@Benchmark
public void walk(Blackhole bh) {
for (Map.Entry<String, String> e : map.entrySet()) {
bh.consume(e.getKey());
bh.consume(e.getValue());
}
}
}
It creates a HashMap
, performs a few rounds of GC, and then walks it. Running it with Epsilon vs Parallel will yield:
Benchmark (size) Mode Cnt Score Error Units
# Epsilon
HashWalkBench.walk 16 avgt 15 0.238 ± 0.005 us/op
HashWalkBench.walk 256 avgt 15 3.638 ± 0.072 us/op
HashWalkBench.walk 4096 avgt 15 59.222 ± 1.974 us/op
HashWalkBench.walk 65536 avgt 15 1102.590 ± 4.331 us/op
HashWalkBench.walk 1048576 avgt 15 19683.680 ± 195.086 us/op
HashWalkBench.walk 16777216 avgt 15 328319.596 ± 7137.066 us/op
# Parallel
HashWalkBench.walk 16 avgt 15 0.240 ± 0.001 us/op
HashWalkBench.walk 256 avgt 15 3.679 ± 0.078 us/op
HashWalkBench.walk 4096 avgt 15 64.778 ± 0.275 us/op
HashWalkBench.walk 65536 avgt 15 1377.634 ± 28.132 us/op
HashWalkBench.walk 1048576 avgt 15 25223.994 ± 853.601 us/op
HashWalkBench.walk 16777216 avgt 15 400679.042 ± 8155.414 us/op
Wait, but haven’t we talked about the stores? Where is the store here? Apart from the infrastructure stores handled by JMH itself, the interesting thing happens behind the for-each loop cover. It implicitly instantiates HashMap.EntrySetIterator
, which saves the current entry in its field. (It would have been scalarized, if escape analysis was not this fragile).
We can clearly see that store and the associated barrier in the generated code:
1.58% 0.91% ...a4e2c: mov %edi,0x18(%r9) ; %r9 = iterator, 0x18(%r9) = field
0.27% 0.33% ...a4e30: mov %r9,%r11 ; r11 = &iterator
0.26% 0.38% ...a4e33: shr $0x9,%r11 ; r11 = r11 >> 9
0.13% 0.20% ...a4e37: movabs $0x7f2c535aa000,%rbx ; rbx = card table base
0.58% 0.57% ...a4e41: mov %r12b,(%rbx,%r11,1) ; put 0 to (rbx+r11)
There are a few observations here:
-
The card mark sets the entire byte, not just a bit. This avoids synchronization: most hardware can perform the byte store atomically, without touching the data around it. This makes card table beefier than it could theoretically be, but still quite dense: 1 byte of card table per 512 bytes of heap, notice the shift by 9.
-
The barrier happens unconditionally, while we only need to record old→young references. This seems practical: we don’t have excess branches on hot path, and we trade card table that spans the entire heap, not only the old. Given how dense the card table is, we would introduce only a tiny fraction of additional footprint. This also helps to move the ephemeral boundary between young and old generations in collectors that can tune themselves, without having to patch the code.
-
The card table address is encoded in the generated code, which is practical again, because it is a native unmovable structure. That saves memory loads, because we would have to poll the card table address from somewhere otherwise.
-
The card mark "set" is actually encoded as "0". This is again practical, because we ca then reuse the zero registers — especially on architectures that have them explicitly — to get the source operand. It does not matter what value to use for card table initialization, 0 or 1, later in native GC code.
This performance picture is further corroborated by hardware counters (normalized to operation, and then divided by 1M):
Benchmark (size) Mode Cnt Score Error Units
# Epsilon
HashWalkBench.walk 1048576 avgt 15 0.019 ± 0.001 us/op
HashWalkBench.walk:L1-dcache-load-misses 1048576 avgt 3 0.389 ± 0.495 #/op
HashWalkBench.walk:L1-dcache-loads 1048576 avgt 3 25.439 ± 2.411 #/op
HashWalkBench.walk:L1-dcache-stores 1048576 avgt 3 20.090 ± 1.184 #/op
HashWalkBench.walk:cycles 1048576 avgt 3 75.230 ± 11.333 #/op
HashWalkBench.walk:instructions 1048576 avgt 3 90.075 ± 10.484 #/op
# Parallel
HashWalkBench.walk 1048576 avgt 15 0.024 ± 0.001 us/op
HashWalkBench.walk:L1-dcache-load-misses 1048576 avgt 3 1.156 ± 0.360 #/op
HashWalkBench.walk:L1-dcache-loads 1048576 avgt 3 25.417 ± 1.711 #/op
HashWalkBench.walk:L1-dcache-stores 1048576 avgt 3 23.265 ± 3.552 #/op
HashWalkBench.walk:cycles 1048576 avgt 3 97.435 ± 69.688 #/op
HashWalkBench.walk:instructions 1048576 avgt 3 102.477 ± 12.689 #/op
So, parallel does the same amount of loads (thanks to encoded card table pointer), and it makes 3 additional stores, that cost around 22 additional cycles and 12 instructions. This addition seems to correspond to three write barriers in this particular workload. Notice that L1 cache misses are ever so slightly greater too: because card mark stores pollute it, reducing the effective cache capacity for the application.
Observations
GCs are usually coming with the set of barriers that affect application performance even when no actual GC work is happening. Even in the case of very basic generational collector like Serial/Parallel, you have at least one reference store barrier that has to record inter-generational barriers. More advanced collectors like G1 have even more sophisticated barriers that track references between the regions. In some cases, that cost is painful to warrant tricks to avoid it, including the no-op GC, like Epsilon.