JVM Anatomy Quark #3: GC Design and Pauses

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

Garbage Collection is the enemy. But I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration… wait, what was the question again? The actual question was about to discuss the claim that "Allocating 100M objects in array list is enough to show how Java hiccups for seconds". Is that true?

Landscape

It is easy to scapegoat generic GC as performance hog, while the problem lies in the GC implementations that do not perform up to your expectations on your workloads. In many cases, those workloads are themselves problematic, but in many cases workloads are running with unsuitable GCs! Let’s see what is there in OpenJDK landscape GC-wise:

Figure 1. OpenJDK GC landscape. Yellow are stop-the-world phases, green are concurrent phases.

Notice how most collectors have pauses in their regular GC cycles.

Experiment

While rejecting the "100M Objects into ArrayList" test as unrealistic is fun, we can still run it and see how it performs. Quickly hacking it together:

import java.util.*;

public class AL {
    static List<Object> l;
    public static void main(String... args) {
        l = new ArrayList<>();
        for (int c = 0; c < 100_000_000; c++) {
            l.add(new Object());
        }
    }
}

A little bit of cow wisdom about this:

$ cowsay ...
 ________________________________________
/ This is a crappy GC benchmark, by the  \
| way. I am a cow, and even I understand |
\ this.                                  /
 ----------------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Still, even a crappy benchmark tells you something about the system under test. You have to only be careful to understand what it tells you. Turns out, the workload like above highlights the GC design choices of different collectors in OpenJDK.

Let’s run with latest JDK 9 + Shenandoah forest to get all the latest GC implementation improvements. For a change, do this on a low-level 1.7 GHz i5 ultrabook with Linux x86_64. Since we are about to allocate 100M 16-byte objects, setting up the heap to static 4 GB seems fine, and eliminates some degrees of freedom between collectors.

G1 (default in JDK 9)

$ time java -Xms4G -Xmx4G -Xlog:gc AL
[0.030s][info][gc] Using G1
[1.525s][info][gc] GC(0) Pause Young (G1 Evacuation Pause) 370M->367M(4096M) 991.610ms
[2.808s][info][gc] GC(1) Pause Young (G1 Evacuation Pause) 745M->747M(4096M) 928.510ms
[3.918s][info][gc] GC(2) Pause Young (G1 Evacuation Pause) 1105M->1107M(4096M) 764.967ms
[5.061s][info][gc] GC(3) Pause Young (G1 Evacuation Pause) 1553M->1555M(4096M) 601.680ms
[5.835s][info][gc] GC(4) Pause Young (G1 Evacuation Pause) 1733M->1735M(4096M) 465.216ms
[6.459s][info][gc] GC(5) Pause Initial Mark (G1 Humongous Allocation) 1894M->1897M(4096M) 398.453ms
[6.459s][info][gc] GC(6) Concurrent Cycle
[7.790s][info][gc] GC(7) Pause Young (G1 Evacuation Pause) 2477M->2478M(4096M) 472.079ms
[8.524s][info][gc] GC(8) Pause Young (G1 Evacuation Pause) 2656M->2659M(4096M) 434.435ms
[11.104s][info][gc] GC(6) Pause Remark 2761M->2761M(4096M) 1.020ms
[11.979s][info][gc] GC(6) Pause Cleanup 2761M->2215M(4096M) 2.446ms
[11.988s][info][gc] GC(6) Concurrent Cycle 5529.427ms

real  0m12.016s
user  0m34.588s
sys   0m0.964s

What do we see with G1? Young pauses that are in in 500..1000 ms range. These pauses are likely to be less once we reach steady state and heuristics figure out how much to collect to keep pause times on target. After a while, concurrent GC cycle starts, and continues until almost to the end. (Notice how Young collections overlap with concurrent phases too). It should have been followed by "mixed" collection pauses, but VM had exited before that. These non-steady-state pauses really contributed to the long run times for this one-off job.

Also, notice how "user" time is larger than "real" (wallclock) time. This is because GC work is parallel, and so while the application is running in a single thread, GCs are using all available parallelism to make the collections faster wallclock-wise.

Parallel

$ time java -XX:+UseParallelOldGC -Xms4G -Xmx4G -Xlog:gc AL
[0.023s][info][gc] Using Parallel
[1.579s][info][gc] GC(0) Pause Young (Allocation Failure) 878M->714M(3925M) 1144.518ms
[3.619s][info][gc] GC(1) Pause Young (Allocation Failure) 1738M->1442M(3925M) 1739.009ms

real  0m3.882s
user  0m11.032s
sys   0m1.516s

With Parallel, we see similar Young pauses, which also probably resized the Eden/Survivors just enough to accept more temporary allocations. Therefore, we have only two large pauses, and the workload finishes quickly. In steady state, this collector would probably keep the same large pauses. "user" >> "real" as well, so some overhead is hiding in concurrent GC work here.

Concurrent Mark Sweep

$ time java -XX:+UseConcMarkSweepGC -Xms4G -Xmx4G -Xlog:gc AL
[0.012s][info][gc] Using Concurrent Mark Sweep
[1.984s][info][gc] GC(0) Pause Young (Allocation Failure) 259M->231M(4062M) 1788.983ms
[2.938s][info][gc] GC(1) Pause Young (Allocation Failure) 497M->511M(4062M) 871.435ms
[3.970s][info][gc] GC(2) Pause Young (Allocation Failure) 777M->850M(4062M) 949.590ms
[4.779s][info][gc] GC(3) Pause Young (Allocation Failure) 1117M->1161M(4062M) 732.888ms
[6.604s][info][gc] GC(4) Pause Young (Allocation Failure) 1694M->1964M(4062M) 1662.255ms
[6.619s][info][gc] GC(5) Pause Initial Mark 1969M->1969M(4062M) 14.831ms
[6.619s][info][gc] GC(5) Concurrent Mark
[8.373s][info][gc] GC(6) Pause Young (Allocation Failure) 2230M->2365M(4062M) 1656.866ms
[10.397s][info][gc] GC(7) Pause Young (Allocation Failure) 3032M->3167M(4062M) 1761.868ms
[16.323s][info][gc] GC(5) Concurrent Mark 9704.075ms
[16.323s][info][gc] GC(5) Concurrent Preclean
[16.365s][info][gc] GC(5) Concurrent Preclean 41.998ms
[16.365s][info][gc] GC(5) Concurrent Abortable Preclean
[16.365s][info][gc] GC(5) Concurrent Abortable Preclean 0.022ms
[16.478s][info][gc] GC(5) Pause Remark 3390M->3390M(4062M) 113.598ms
[16.479s][info][gc] GC(5) Concurrent Sweep
[17.696s][info][gc] GC(5) Concurrent Sweep 1217.415ms
[17.696s][info][gc] GC(5) Concurrent Reset
[17.701s][info][gc] GC(5) Concurrent Reset 5.439ms

real  0m17.719s
user  0m45.692s
sys   0m0.588s

Contrary to popular belief, in CMS, "Concurrent" means the concurrent collections in old generation. The young collections are still stopping the world, as we can see here. From the GC log standpoint, the phasing looks like G1: young pauses, concurrent cycle. The difference is that "Concurrent Sweep" cleans up old without stopping the application, in contrast to G1 Mixed pauses. Anyhow, the longer Young GC pauses without heuristics able to catch up with them yet defines the performance on this quick job.

Shenandoah

$ time java -XX:+UseShenandoahGC -Xms4G -Xmx4G -Xlog:gc AL
[0.026s][info][gc] Using Shenandoah
[0.808s][info][gc] GC(0) Pause Init Mark 0.839ms
[1.883s][info][gc] GC(0) Concurrent marking 2076M->3326M(4096M) 1074.924ms
[1.893s][info][gc] GC(0) Pause Final Mark 3326M->2784M(4096M) 10.240ms
[1.894s][info][gc] GC(0) Concurrent evacuation  2786M->2792M(4096M) 0.759ms
[1.894s][info][gc] GC(0) Concurrent reset bitmaps 0.153ms
[1.895s][info][gc] GC(1) Pause Init Mark 0.920ms
[1.998s][info][gc] Cancelling concurrent GC: Stopping VM
[2.000s][info][gc] GC(1) Concurrent marking 2794M->2982M(4096M) 104.697ms

real  0m2.021s
user  0m5.172s
sys   0m0.420s

In Shenandoah, there are no young collections. (At least today. There are some ideas how to get quick partial collections without introducing generations — but they are unlikely to be stop-the-world). The concurrent GC cycle starts and runs along with application, stopping it with two minor pauses to initiate and finish the concurrent mark. Concurrent copying takes nothing, because everything is alive and not yet fragmented. The second GC cycle terminates early due to VM shutdown. The absence of pauses like in other collectors explains why the workload can finish quickly.

Epsilon

$ time java -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -Xms4G -Xmx4G  -Xlog:gc AL
[0.031s][info][gc] Initialized with 4096M non-resizable heap.
[0.031s][info][gc] Using Epsilon GC
[1.361s][info][gc] Total allocated: 2834042 KB.
[1.361s][info][gc] Average allocation rate: 2081990 KB/sec

real  0m1.415s
user  0m1.240s
sys   0m0.304s

Running with experimental "no-op" Epsilon GC can help to estimate GC overheads when no collector is running at all. Here, we can fit exactly in 4 GB pre-sized heap, and application runs with no pauses whatsoever. It would not survive anything more actively mutating the heap, though. Notice that "real" and "user"+"sys" times are almost equal, which corroborates the theory there was a single application thread only.

Observations

Different GCs have different tradeoffs in their implementations. Brushing the GC off as "bad idea" is a stretch. Choose a collector for your kind of workload, by understanding your workload, available GC implementations, and your performance requirements. Even if you choose to target platforms without GCs, you would still need to know (and choose!) your native memory allocators. When running experimental workloads, try to understand what they tell you, and learn from that. Peace.