JVM Anatomy Quark #2: Transparent Huge Pages

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

What are Large Pages? What are Transparent Huge Pages? How does it help me?!

Theory

Virtual memory is taken for granted now. Only a few now remember, let alone do, some "real mode" programming, where you are exposed to the actual physical memory. Instead, every process has its own virtual memory space, and that space is mapped onto actual memory. That allows, for instance, for two processes have distinct data at same virtual address 0x42424242, which will be backed by different physical memory. Now, when a program does the access at that address, something should translate that virtual address to physical one.

This is normally achieved by OS maintaining the "page table", and hardware doing the "page table walk" through that table to translate the address. The whole thing gets easier when translations are maintained at page granularity. But it is nevertheless not very cheap, and it needs to happen for every memory access! Therefore, there is also a small cache of latest translations, Translation Lookaside Buffer (TLB). TLB is usually very small, below 100 of entries, because it needs to be at least as fast as L1 cache, if not faster. For many workloads, TLB misses and associated page table walks take significant time.

Since we cannot do TLB larger, we can do something else: make larger pages! Most hardware has 4K basic pages, and 2M/4M/1G "large pages". Having larger pages to cover the same region also makes page tables themselves smaller, making the cost of page table walk lower.

In Linux world, there are at least two distinct ways to get this in applications:

hugetlbfs. Cut out the part of system memory, expose it as virtual filesystem, and let applications mmap(2) from it. This is a peculiar interface that requires both OS configuration and application changes to use. This also "all or nothing" kind of deal: the space allocated for (the persistent part of) hugetlbfs cannot be used by regular processes.
Transparent Huge Pages (THP). Let application allocate memory as usual, but try to provide large-pages-backed storage transparently to the application. Ideally, no application changes are needed, but we will see how applications can benefit from knowing THP is available. In practice, though, there are memory overheads (because you will allocate an entire large page for something small), or time overheads (because sometimes THP needs to defrag memory to allocate pages). The good part is that there is a middle-ground: madvise(2) lets application tell Linux where to use THP.

Why the nomenclature uses "large" and "huge" interchangeably is beyond me. Anyway, OpenJDK supports both modes:

$ java -XX:+PrintFlagsFinal 2>&1 | grep Huge
  bool UseHugeTLBFS             = false      {product} {default}
  bool UseTransparentHugePages  = false      {product} {default}
$ java -XX:+PrintFlagsFinal 2>&1 | grep LargePage
  bool UseLargePages            = false   {pd product} {default}

-XX:+UseHugeTLBFS mmaps Java heap into hugetlbfs, that should be prepared separately.

-XX:+UseTransparentHugePages just madvise-s that Java heap should use THP. This is convenient option, because we know that Java heap is large, mostly contiguous, and probably benefits from large pages the most.

-XX:+UseLargePages is a generic shortcut that enables anything available. On Linux, it enables hugetlbfs, not THP. I guess that is for historical reasons, because hugetlbfs came first.

Some applications do suffer with large pages enabled. (It is sometimes funny to see how people do manual memory management to avoid GCs, only to hit THP defrag causing latency spikes for them!) My gut feeling is that THP regresses mostly short-lived applications where defrag costs are visible in comparison to short application time.

Experiment

Can we show what benefit large pages give us? Of course we can, let’s take a workload that any systems performance engineer had run at least once by their mid-thirties. Allocate and randomly touch a byte[] array:

public class ByteArrayTouch {

    @Param(...)
    int size;

    byte[] mem;

    @Setup
    public void setup() {
        mem = new byte[size];
    }

    @Benchmark
    public byte test() {
        return mem[ThreadLocalRandom.current().nextInt(size)];
    }
}

(full source here)

We know that depending on size, the performance would be dominated either by L1 cache misses, or L2 cache misses, or L3 cache misses, etc. What this picture usually omits is the TLB miss costs.

Before we run the test, we need to decide how much heap we will take. On my machine, L3 is about 8M, so 100M array would be enough to get past it. That means, pessimistically allocating 1G heap with -Xmx1G -Xms1G would be enough. This also gives us a guideline how much to allocate for hugetlbfs.

So, making sure these options are set:

# HugeTLBFS should allocate 1000*2M pages:
sudo sysctl -w vm.nr_hugepages=1000

# THP to "madvise" only (some distros have an opinion about defaults):
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

I like to do "madvise" for THP, because it lets me to "opt-in" for particular parts of memory we know would benefit.

Running on i7 4790K, Linux x86_64, JDK 8u101:

Benchmark               (size)  Mode  Cnt   Score   Error  Units

# Baseline
ByteArrayTouch.test       1000  avgt   15   8.109 ± 0.018  ns/op
ByteArrayTouch.test      10000  avgt   15   8.086 ± 0.045  ns/op
ByteArrayTouch.test    1000000  avgt   15   9.831 ± 0.139  ns/op
ByteArrayTouch.test   10000000  avgt   15  19.734 ± 0.379  ns/op
ByteArrayTouch.test  100000000  avgt   15  32.538 ± 0.662  ns/op

# -XX:+UseTransparentHugePages
ByteArrayTouch.test       1000  avgt   15   8.104 ± 0.012  ns/op
ByteArrayTouch.test      10000  avgt   15   8.060 ± 0.005  ns/op
ByteArrayTouch.test    1000000  avgt   15   9.193 ± 0.086  ns/op // !
ByteArrayTouch.test   10000000  avgt   15  17.282 ± 0.405  ns/op // !!
ByteArrayTouch.test  100000000  avgt   15  28.698 ± 0.120  ns/op // !!!

# -XX:+UseHugeTLBFS
ByteArrayTouch.test       1000  avgt   15   8.104 ± 0.015  ns/op
ByteArrayTouch.test      10000  avgt   15   8.062 ± 0.011  ns/op
ByteArrayTouch.test    1000000  avgt   15   9.303 ± 0.133  ns/op // !
ByteArrayTouch.test   10000000  avgt   15  17.357 ± 0.217  ns/op // !!
ByteArrayTouch.test  100000000  avgt   15  28.697 ± 0.291  ns/op // !!!

A few observations here:

On smaller sizes, both cache and TLB are fine, and there is no difference against baseline.
On larger sizes, cache misses start to dominate, this is why costs grow in every configuration.
On larger sizes, TLB misses come to picture, and enabling large pages helps a lot!
Both UseTHP and UseHTLBFS help the same, because they are providing the same service to application.

To verify the TLB miss hypothesis, we can see the hardware counters. JMH -prof perfnorm gives them normalized by operation.

Benchmark                                (size)  Mode  Cnt    Score    Error  Units

# Baseline
ByteArrayTouch.test                   100000000  avgt   15   33.575 ±  2.161  ns/op
ByteArrayTouch.test:cycles            100000000  avgt    3  123.207 ± 73.725   #/op
ByteArrayTouch.test:dTLB-load-misses  100000000  avgt    3    1.017 ±  0.244   #/op  // !!!
ByteArrayTouch.test:dTLB-loads        100000000  avgt    3   17.388 ±  1.195   #/op

# -XX:+UseTransparentHugePages
ByteArrayTouch.test                   100000000  avgt   15   28.730 ±  0.124  ns/op
ByteArrayTouch.test:cycles            100000000  avgt    3  105.249 ±  6.232   #/op
ByteArrayTouch.test:dTLB-load-misses  100000000  avgt    3   ≈ 10⁻³            #/op
ByteArrayTouch.test:dTLB-loads        100000000  avgt    3   17.488 ±  1.278   #/op

There we go! One dTLB load miss per operation in baseline, and much less with THP enabled.

Of course, with THP defrag enabled, you will pay the upfront cost of defragmentation at allocation/access time. To shift these costs to the JVM startup that will avoid surprising latency hiccups when application is running, you may instruct JVM to touch every single page in Java heap with -XX:+AlwaysPreTouch during initialization. It is a good idea to enable pre-touch for larger heaps anyway.

And there comes the funny part: enabling -XX:+UseTransparentHugePages actually makes -XX:+AlwaysPreTouch faster, because OS now handles larger pages: there are less of them to handle, and there are more wins in streaming (zeroing) writes by OS. Freeing memory after process dies is also faster with THP, sometimes gruesomely so, until parallel freeing patch trickles down to distro kernels.

Case in point, using 4 TB (terabyte, with a T) heap:

$ time java -Xms4T -Xmx4T -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch
real    13m58.167s  # About 5 GB/sec
user    43m37.519s
sys     1011m25.740s

$ time java -Xms4T -Xmx4T -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
real    2m14.758s   # About 31 GB/sec
user    1m56.488s
sys     73m59.046s

Committing and freeing 4 TB sure takes a while!

Observations

Large pages are easy trick to boost application performance. Transparent Huge Pages in Linux kernel makes it more accessible. Transparent Huge Pages support in JVM makes it easy to opt-in. It is always a good idea to try large pages, especially if your application has lots of data and large heaps.