About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Theory
Virtual memory is taken for granted now. Only a few now remember, let alone do, some "real mode" programming, where you are exposed to the actual physical memory. Instead, every process has its own virtual memory space, and that space is mapped onto actual memory. That allows, for instance, for two processes have distinct data at same virtual address 0x42424242
, which will be backed by different physical memory. Now, when a program does the access at that address, something should translate that virtual address to physical one.
This is normally achieved by OS maintaining the "page table", and hardware doing the "page table walk" through that table to translate the address. The whole thing gets easier when translations are maintained at page granularity. But it is nevertheless not very cheap, and it needs to happen for every memory access! Therefore, there is also a small cache of latest translations, Translation Lookaside Buffer (TLB). TLB is usually very small, below 100 of entries, because it needs to be at least as fast as L1 cache, if not faster. For many workloads, TLB misses and associated page table walks take significant time.
Since we cannot do TLB larger, we can do something else: make larger pages! Most hardware has 4K basic pages, and 2M/4M/1G "large pages". Having larger pages to cover the same region also makes page tables themselves smaller, making the cost of page table walk lower.
In Linux world, there are at least two distinct ways to get this in applications:
-
hugetlbfs. Cut out the part of system memory, expose it as virtual filesystem, and let applications
mmap(2)
from it. This is a peculiar interface that requires both OS configuration and application changes to use. This also "all or nothing" kind of deal: the space allocated for (the persistent part of) hugetlbfs cannot be used by regular processes. -
Transparent Huge Pages (THP). Let application allocate memory as usual, but try to provide large-pages-backed storage transparently to the application. Ideally, no application changes are needed, but we will see how applications can benefit from knowing THP is available. In practice, though, there are memory overheads (because you will allocate an entire large page for something small), or time overheads (because sometimes THP needs to defrag memory to allocate pages). The good part is that there is a middle-ground:
madvise(2)
lets application tell Linux where to use THP.
Why the nomenclature uses "large" and "huge" interchangeably is beyond me. Anyway, OpenJDK supports both modes:
$ java -XX:+PrintFlagsFinal 2>&1 | grep Huge
bool UseHugeTLBFS = false {product} {default}
bool UseTransparentHugePages = false {product} {default}
$ java -XX:+PrintFlagsFinal 2>&1 | grep LargePage
bool UseLargePages = false {pd product} {default}
-XX:+UseHugeTLBFS
mmaps Java heap into hugetlbfs, that should be prepared separately.
-XX:+UseTransparentHugePages
just madvise
-s that Java heap should use THP. This is convenient option, because we know that Java heap is large, mostly contiguous, and probably benefits from large pages the most.
-XX:+UseLargePages
is a generic shortcut that enables anything available. On Linux, it enables hugetlbfs, not THP. I guess that is for historical reasons, because hugetlbfs came first.
Some applications do suffer with large pages enabled. (It is sometimes funny to see how people do manual memory management to avoid GCs, only to hit THP defrag causing latency spikes for them!) My gut feeling is that THP regresses mostly short-lived applications where defrag costs are visible in comparison to short application time.
Experiment
Can we show what benefit large pages give us? Of course we can, let’s take a workload that any systems performance engineer had run at least once by their mid-thirties. Allocate and randomly touch a byte[]
array:
public class ByteArrayTouch {
@Param(...)
int size;
byte[] mem;
@Setup
public void setup() {
mem = new byte[size];
}
@Benchmark
public byte test() {
return mem[ThreadLocalRandom.current().nextInt(size)];
}
}
(full source here)
We know that depending on size, the performance would be dominated either by L1 cache misses, or L2 cache misses, or L3 cache misses, etc. What this picture usually omits is the TLB miss costs.
Before we run the test, we need to decide how much heap we will take. On my machine, L3 is about 8M, so 100M array would be enough to get past it. That means, pessimistically allocating 1G heap with -Xmx1G -Xms1G
would be enough. This also gives us a guideline how much to allocate for hugetlbfs.
So, making sure these options are set:
# HugeTLBFS should allocate 1000*2M pages:
sudo sysctl -w vm.nr_hugepages=1000
# THP to "madvise" only (some distros have an opinion about defaults):
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
I like to do "madvise" for THP, because it lets me to "opt-in" for particular parts of memory we know would benefit.
Running on i7 4790K, Linux x86_64, JDK 8u101:
Benchmark (size) Mode Cnt Score Error Units
# Baseline
ByteArrayTouch.test 1000 avgt 15 8.109 ± 0.018 ns/op
ByteArrayTouch.test 10000 avgt 15 8.086 ± 0.045 ns/op
ByteArrayTouch.test 1000000 avgt 15 9.831 ± 0.139 ns/op
ByteArrayTouch.test 10000000 avgt 15 19.734 ± 0.379 ns/op
ByteArrayTouch.test 100000000 avgt 15 32.538 ± 0.662 ns/op
# -XX:+UseTransparentHugePages
ByteArrayTouch.test 1000 avgt 15 8.104 ± 0.012 ns/op
ByteArrayTouch.test 10000 avgt 15 8.060 ± 0.005 ns/op
ByteArrayTouch.test 1000000 avgt 15 9.193 ± 0.086 ns/op // !
ByteArrayTouch.test 10000000 avgt 15 17.282 ± 0.405 ns/op // !!
ByteArrayTouch.test 100000000 avgt 15 28.698 ± 0.120 ns/op // !!!
# -XX:+UseHugeTLBFS
ByteArrayTouch.test 1000 avgt 15 8.104 ± 0.015 ns/op
ByteArrayTouch.test 10000 avgt 15 8.062 ± 0.011 ns/op
ByteArrayTouch.test 1000000 avgt 15 9.303 ± 0.133 ns/op // !
ByteArrayTouch.test 10000000 avgt 15 17.357 ± 0.217 ns/op // !!
ByteArrayTouch.test 100000000 avgt 15 28.697 ± 0.291 ns/op // !!!
A few observations here:
-
On smaller sizes, both cache and TLB are fine, and there is no difference against baseline.
-
On larger sizes, cache misses start to dominate, this is why costs grow in every configuration.
-
On larger sizes, TLB misses come to picture, and enabling large pages helps a lot!
-
Both
UseTHP
andUseHTLBFS
help the same, because they are providing the same service to application.
To verify the TLB miss hypothesis, we can see the hardware counters. JMH -prof perfnorm
gives them normalized by operation.
Benchmark (size) Mode Cnt Score Error Units
# Baseline
ByteArrayTouch.test 100000000 avgt 15 33.575 ± 2.161 ns/op
ByteArrayTouch.test:cycles 100000000 avgt 3 123.207 ± 73.725 #/op
ByteArrayTouch.test:dTLB-load-misses 100000000 avgt 3 1.017 ± 0.244 #/op // !!!
ByteArrayTouch.test:dTLB-loads 100000000 avgt 3 17.388 ± 1.195 #/op
# -XX:+UseTransparentHugePages
ByteArrayTouch.test 100000000 avgt 15 28.730 ± 0.124 ns/op
ByteArrayTouch.test:cycles 100000000 avgt 3 105.249 ± 6.232 #/op
ByteArrayTouch.test:dTLB-load-misses 100000000 avgt 3 ≈ 10⁻³ #/op
ByteArrayTouch.test:dTLB-loads 100000000 avgt 3 17.488 ± 1.278 #/op
There we go! One dTLB load miss per operation in baseline, and much less with THP enabled.
Of course, with THP defrag enabled, you will pay the upfront cost of defragmentation at allocation/access time. To shift these costs to the JVM startup that will avoid surprising latency hiccups when application is running, you may instruct JVM to touch every single page in Java heap with -XX:+AlwaysPreTouch
during initialization. It is a good idea to enable pre-touch for larger heaps anyway.
And there comes the funny part: enabling -XX:+UseTransparentHugePages
actually makes -XX:+AlwaysPreTouch
faster, because OS now handles larger pages: there are less of them to handle, and there are more wins in streaming (zeroing) writes by OS. Freeing memory after process dies is also faster with THP, sometimes gruesomely so, until parallel freeing patch trickles down to distro kernels.
Case in point, using 4 TB (terabyte, with a T) heap:
$ time java -Xms4T -Xmx4T -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch
real 13m58.167s # About 5 GB/sec
user 43m37.519s
sys 1011m25.740s
$ time java -Xms4T -Xmx4T -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
real 2m14.758s # About 31 GB/sec
user 1m56.488s
sys 73m59.046s
Committing and freeing 4 TB sure takes a while!
Observations
Large pages are easy trick to boost application performance. Transparent Huge Pages in Linux kernel makes it more accessible. Transparent Huge Pages support in JVM makes it easy to opt-in. It is always a good idea to try large pages, especially if your application has lots of data and large heaps.