JVM Anatomy Quark #11: Moving GC and Locality

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

So I have heard non-moving garbage collectors are okay, because $reasons. Slower allocations and fragmentation do not concern me. Are there other implications?

Theory

If you open GC Handbook, there is an a small section on "Is compaction necessary?". The last point they make is:

Mark-compact collectors may preserve the allocation order of objects in the heap or they may rearrange them arbitrarily. Although arbitrary order collectors may be faster than other mark-compact collectors and impose no space overheads, the mutator’s locality likely to suffer from an arbitrary scrambling of object order.

— GC Handbook
3.5. Issues to consider. Locality

Does that really matter?

Experiment

Once again, we can construct the simple experiment to see how that works. The simplest test case we can come up with in the array of references, that is either shuffled or not, and walk its contents. In JMH speak, something like this:

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.*;

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1, jvmArgsAppend = {"-Xms8g", "-Xmx8g", "-Xmn7g" })
public class ArrayWalkBench {

    @Param({"16", "256", "4096", "65536", "1048576", "16777216"})
    int size;

    @Param({"false", "true"})
    boolean shuffle;

    @Param({"false", "true"})
    boolean gc;

    String[] arr;

    @Setup
    public void setup() throws IOException, InterruptedException {
        arr = new String[size];
        for (int c = 0; c < size; c++) {
            arr[c] = "Value" + c;
        }
        if (shuffle) {
            Collections.shuffle(Arrays.asList(arr), new Random(0xBAD_BEE));
        }
        if (gc) {
            for (int c = 0; c < 5; c++) {
                System.gc();
                TimeUnit.SECONDS.sleep(1);
            }
        }
    }

    @Benchmark
    public void walk(Blackhole bh) {
        for (String val : arr) {
            bh.consume(val.hashCode());
        }
    }
}

Note the experiment has three degrees of freedom:

size of the array in question — it is always a good idea to try several sizes when you want to implicate different caches in the memory hierarchy, and afraid to put the benchmark into accidental sweet spot.
shuffle, that tells if we shuffle the array before walking it. Enabling shuffling simulates the case when insertions are not in order and/or happen over time to different indexes.
gc, that tells to force GC after preparing the data set. Since workload is not allocating in the payload code, we need to force GC explicitly, otherwise it would never run.

What would change with different settings? Let us take -XX:+UseParallelGC and see:

Benchmark             (gc)  (shuffle)    (size)  Mode  Cnt       Score      Error  Units

ArrayWalkBench.walk  false      false        16  avgt    5       0.051 ±    0.001  us/op
ArrayWalkBench.walk  false      false       256  avgt    5       0.821 ±    0.001  us/op
ArrayWalkBench.walk  false      false      4096  avgt    5      14.516 ±    0.026  us/op
ArrayWalkBench.walk  false      false     65536  avgt    5     307.210 ±    4.789  us/op
ArrayWalkBench.walk  false      false   1048576  avgt    5    4306.660 ±    7.950  us/op
ArrayWalkBench.walk  false      false  16777216  avgt    5   60561.476 ±  925.685  us/op

ArrayWalkBench.walk  false       true        16  avgt    5       0.051 ±    0.001  us/op
ArrayWalkBench.walk  false       true       256  avgt    5       0.823 ±    0.003  us/op
ArrayWalkBench.walk  false       true      4096  avgt    5      18.646 ±    0.044  us/op
ArrayWalkBench.walk  false       true     65536  avgt    5     461.187 ±   31.183  us/op
ArrayWalkBench.walk  false       true   1048576  avgt    5   16350.706 ±   75.757  us/op
ArrayWalkBench.walk  false       true  16777216  avgt    5  312296.960 ±  632.552  us/op

ArrayWalkBench.walk   true      false        16  avgt    5       0.051 ±    0.001  us/op
ArrayWalkBench.walk   true      false       256  avgt    5       0.820 ±    0.004  us/op
ArrayWalkBench.walk   true      false      4096  avgt    5      13.639 ±    0.063  us/op
ArrayWalkBench.walk   true      false     65536  avgt    5     174.475 ±    0.771  us/op
ArrayWalkBench.walk   true      false   1048576  avgt    5    4345.980 ±   15.230  us/op
ArrayWalkBench.walk   true      false  16777216  avgt    5   68687.192 ± 1375.171  us/op

ArrayWalkBench.walk   true       true        16  avgt    5       0.051 ±    0.001  us/op
ArrayWalkBench.walk   true       true       256  avgt    5       0.828 ±    0.010  us/op
ArrayWalkBench.walk   true       true      4096  avgt    5      13.749 ±    0.088  us/op
ArrayWalkBench.walk   true       true     65536  avgt    5     174.230 ±    0.655  us/op
ArrayWalkBench.walk   true       true   1048576  avgt    5    4365.162 ±   88.927  us/op
ArrayWalkBench.walk   true       true  16777216  avgt    5   70631.288 ± 1144.980  us/op

What do we see here?

We see that walking the shuffled array is indeed much, much slower than the initial in-order array — around 4x times slower! So, here is the hint: memory layout of object graph matters! You can control this in the code in some way during the initial load, but not when external clients put something at (possibly random) indexes.

We also see that after GC happened, both cases improved, because GC had compacted the space taken by the array out of temporary objects, if any, plus it moved the objects in memory so that they laid out in memory in array order. The difference between shuffled and non-shuffled version is basically gone. Therefore, here is another hint: not only GCs introduce overheads in your application, but they also pay back by helping to re-arrange objects to benefit locality, like this:

gc locality improv

If you only have a non-moving collector, you pay the price of GC without one of its major benefits! Indeed, this is one of the reasons why no-op GC like Epsilon may run application slower than a compacting GC. This is Epsilon running the same workload (gc = true is not applicable to it):

Benchmark             (gc)  (shuffle)    (size)  Mode  Cnt       Score      Error  Units

ArrayWalkBench.walk  false      false        16  avgt    5       0.051 ±    0.001  us/op
ArrayWalkBench.walk  false      false       256  avgt    5       0.826 ±    0.006  us/op
ArrayWalkBench.walk  false      false      4096  avgt    5      14.556 ±    0.049  us/op
ArrayWalkBench.walk  false      false     65536  avgt    5     274.073 ±   37.163  us/op
ArrayWalkBench.walk  false      false   1048576  avgt    5    4383.223 ±  997.953  us/op
ArrayWalkBench.walk  false      false  16777216  avgt    5   60322.171 ±  266.683  us/op

ArrayWalkBench.walk  false       true        16  avgt    5       0.051 ±    0.001  us/op
ArrayWalkBench.walk  false       true       256  avgt    5       0.826 ±    0.007  us/op
ArrayWalkBench.walk  false       true      4096  avgt    5      18.169 ±    0.165  us/op
ArrayWalkBench.walk  false       true     65536  avgt    5     312.345 ±   26.149  us/op
ArrayWalkBench.walk  false       true   1048576  avgt    5   16445.739 ±   54.241  us/op
ArrayWalkBench.walk  false       true  16777216  avgt    5  311573.643 ± 3650.280  us/op

Yes, you read that right, Epsilon runs slower than Parallel. Being a no-op GC, it does not incur GC overheads, but it also does not bring any benefits.

The cause for performance difference is very simple, and visible with -prof perfnorm (we also use -opi 1048576 to divide by number of elements):

Benchmark                    (gc)  (shuffle)   (size)  Mode  Cnt   Score    Error  Units

walk                         true       true  1048576  avgt   25   4.247 ±  0.090  ns/op
walk:CPI                     true       true  1048576  avgt    5   0.498 ±  0.050   #/op
walk:L1-dcache-load-misses   true       true  1048576  avgt    5   0.955 ±  0.025   #/op
walk:L1-dcache-loads         true       true  1048576  avgt    5  12.149 ±  0.432   #/op
walk:L1-dcache-stores        true       true  1048576  avgt    5   7.027 ±  0.176   #/op
walk:LLC-load-misses         true       true  1048576  avgt    5   0.156 ±  0.029   #/op
walk:LLC-loads               true       true  1048576  avgt    5   0.514 ±  0.014   #/op
walk:cycles                  true       true  1048576  avgt    5  17.056 ±  1.673   #/op
walk:instructions            true       true  1048576  avgt    5  34.223 ±  0.860   #/op

walk                        false       true  1048576  avgt   25  16.155 ±  0.101  ns/op
walk:CPI                    false       true  1048576  avgt    5   1.885 ±  0.069   #/op
walk:L1-dcache-load-misses  false       true  1048576  avgt    5   1.922 ±  0.076   #/op
walk:L1-dcache-loads        false       true  1048576  avgt    5  12.128 ±  0.326   #/op
walk:L1-dcache-stores       false       true  1048576  avgt    5   7.032 ±  0.212   #/op
walk:LLC-load-misses        false       true  1048576  avgt    5   1.031 ±  0.032   #/op
walk:LLC-loads              false       true  1048576  avgt    5   1.365 ±  0.101   #/op
walk:cycles                 false       true  1048576  avgt    5  64.700 ±  2.613   #/op
walk:instructions           false       true  1048576  avgt    5  34.335 ±  1.564   #/op

With shuffled version, you have around 2 clocks per instruction, and almost every last-level cache load is a miss. No wonder it runs slower: random walks over memory would cost you a lot.

There are also interesting visualizations for moving GCs in JOL Samples, under JOLSample_22_Compaction, JOLSample_23_Defragmentation, and JOLSample_24_Colocation.

Observations

The irony of all GC discussions is that having GC is sometimes painful, but having no GC is sometimes painful too!

It is very easy to underestimate the locality implications of having the non-moving GC.

I think one of the reasons why CMS works fine, while not relocating the objects in "tenured" generation, is that it has the copying "young" collection that at least makes attempt to compact before committing to particular object order in "tenured". STW collectors like Serial and Parallel(Old) reap the benefits of this for almost every collection. Regionalized collectors like G1 and Shenandoah can, should, and will exploit this too — although substantially more work is needed there because heap traversals are decoupled from evacuation. It would be audacious to claim locality does not matter. Enter NUMA, where locality penalties skyrocket, and be prepared to get royally screwed.

Note this locality property is about the object graphs, not the object layout itself. Even if a language provides the capabilities for controlling the memory layout of objects, that in all cases that I am aware of, cares about the object interiors (or, at most array of structures -like ensembles of objects), but not the arbitrary object graphs. Once you have put the regular objects in the particular places in memory — for example, not the dense array, but linked list, linked queue, concurrent skiplist, chained hashtable, what have you — you are stuck with the object graph linearized in memory in that particular way, unless you have a moving memory manager.

Also note that this locality property is dynamic — that is, it is dependent on what is actually going on in a particular application session, because applications change the object graph when running. You can teach your application to react to this appropriately by cleverly relocating its own data structures, but then you will find yourself implementing the moving automatic memory manager — or, a moving GC.

It also has nothing to do with the allocation rates — notice that the workload in almost purely static in the example above — and this is usually the case for many real life collections, especially large ones, representing the in-memory chunks of moderately frequently changed data. In this overwhelmingly frequent case it makes sense to let GC adapt to new layout, and then run with it for a while, until it changes again. Hoping that application would put them in proper order to benefit locality by itself is a wishful thinking, unless carefully designed in such a way.

Be ever so slightly wary when someone sells you the non-moving GC or no-GC solutions, and telling everything is going to be fine. Because they probably shift the locality problems (if not all other memory management problems) to your code to handle. There is no solution without (not so) hidden costs. Maybe it would be fine, the benefits would outweigh these known costs? Or maybe you just oblivious of the costs, and sellers would diplomatically avoid the topic?