JVM Anatomy Quark #12: Native Memory Tracking

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

I have 512 MB of available memory, so I set -Xms512m -Xmx512m, and my VM fails with "not enough memory to proceed". Why?

Theory

JVM is the native application, and it also needs memory to maintain its internal data structures that represent application code, generated machine code, heap metadata, class metadata, internal profiling, etc. This is not accounted in Java heap, because most of those things are native, allocated in C heap, or mmap-ed to memory. JVM also prepares lots of things expecting the active long-running application with decent number of classes loaded, enough generated code created at runtime, etc. The defaults may be too high for short-lived applications in memory-constrained scenarios.

OpenJDK 8 onwards has a nice internal VM feature, called "Native Memory Tracking" (NMT): it instruments all internal VM allocations and lets categorize them, get the idea where they are coming from, etc. This feature is invaluable for understanding what VM uses memory for.

NMT can be enabled with -XX:NativeMemoryTracking=summary. You can get jcmd to dump the current NMT data, or you may request the data dump at JVM termination with -XX:+PrintNMTStatistics. Saying -XX:NativeMemoryTracking=detail would get the memory map for mmaps and callstacks for mallocs.

Most of the time, "summary" suffices for the overview. But we can also read the "detail"-ed log, see where allocations are coming from or for what, read the VM source code, and/or play with VM options to see what affects what. For example, take a simple "Hello World" application, like this:

public class Hello {
  public static void main(String... args) {
    System.out.println("Hello");
  }
}

It is obvious that Java heap takes a significant part of allocated memory, let’s trim -Xmx16m -Xms16m, and see our baseline:

Native Memory Tracking:

Total: reserved=1373921KB, committed=74953KB
-                 Java Heap (reserved=16384KB, committed=16384KB)
                            (mmap: reserved=16384KB, committed=16384KB)

-                     Class (reserved=1066093KB, committed=14189KB)
                            (classes #391)
                            (malloc=9325KB #148)
                            (mmap: reserved=1056768KB, committed=4864KB)

-                    Thread (reserved=19614KB, committed=19614KB)
                            (thread #19)
                            (stack: reserved=19532KB, committed=19532KB)
                            (malloc=59KB #105)
                            (arena=22KB #38)

-                      Code (reserved=249632KB, committed=2568KB)
                            (malloc=32KB #297)
                            (mmap: reserved=249600KB, committed=2536KB)

-                        GC (reserved=10991KB, committed=10991KB)
                            (malloc=10383KB #129)
                            (mmap: reserved=608KB, committed=608KB)

-                  Compiler (reserved=132KB, committed=132KB)
                            (malloc=2KB #23)
                            (arena=131KB #3)

-                  Internal (reserved=9444KB, committed=9444KB)
                            (malloc=9412KB #1373)
                            (mmap: reserved=32KB, committed=32KB)

-                    Symbol (reserved=1356KB, committed=1356KB)
                            (malloc=900KB #65)
                            (arena=456KB #1)

-    Native Memory Tracking (reserved=38KB, committed=38KB)
                            (malloc=3KB #41)
                            (tracking overhead=35KB)

-               Arena Chunk (reserved=237KB, committed=237KB)
                            (malloc=237KB)

Okay. 75 MB for 16 MB Java heap is certainly unexpected.

Slimdown: Sane Parts

Let’s roll over different parts of that NMT output to see if those parts are tunable.

Start with something familiar:

-                        GC (reserved=10991KB, committed=10991KB)
                            (malloc=10383KB #129)
                            (mmap: reserved=608KB, committed=608KB)

This accounts for GC native structures. The log says GC malloc-ed around 10 MB and mmap-ed around 0.6 MB. One should expect this to grow with increasing heap size, if those structures describe something about the heap — for example, marking bitmaps, card tables, remembered sets, etc. Indeed it does so:

# Xms/Xmx = 512 MB
-                        GC (reserved=29543KB, committed=29543KB)
                            (malloc=10383KB #129)
                            (mmap: reserved=19160KB, committed=19160KB)

# Xms/Xmx = 4 GB
-                        GC (reserved=163627KB, committed=163627KB)
                            (malloc=10383KB #129)
                            (mmap: reserved=153244KB, committed=153244KB)

# Xms/Xmx = 16 GB
-                        GC (reserved=623339KB, committed=623339KB)
                            (malloc=10383KB #129)
                            (mmap: reserved=612956KB, committed=612956KB)

Quite probably malloc-ed parts are the C heap allocations of task queues for parallel GC, and mmap-ed regions are the bitmaps. Not surprisingly, they grow with heap size, and take around 3-4% from the configured heap size. This raises deployment questions, like in the original question: configuring the heap size to take all available physical memory will blow the memory limits, possibly swapping, possibly invoking OOM killer.

But that overhead is also dependent on the GC in use, because different GCs choose to represent Java heap differently. For example, switching back to the most lightweight GC in OpenJDK, -XX:+UseSerialGC, yields this dramatic change in our test case:

-Total: reserved=1374184KB, committed=75216KB
+Total: reserved=1336541KB, committed=37573KB

--                     Class (reserved=1066093KB, committed=14189KB)
+-                     Class (reserved=1056877KB, committed=4973KB)
                             (classes #391)
-                            (malloc=9325KB #148)
+                            (malloc=109KB #127)
                             (mmap: reserved=1056768KB, committed=4864KB)

--                    Thread (reserved=19614KB, committed=19614KB)
-                            (thread #19)
-                            (stack: reserved=19532KB, committed=19532KB)
-                            (malloc=59KB #105)
-                            (arena=22KB #38)
+-                    Thread (reserved=11357KB, committed=11357KB)
+                            (thread #11)
+                            (stack: reserved=11308KB, committed=11308KB)
+                            (malloc=36KB #57)
+                            (arena=13KB #22)

--                        GC (reserved=10991KB, committed=10991KB)
-                            (malloc=10383KB #129)
-                            (mmap: reserved=608KB, committed=608KB)
+-                        GC (reserved=67KB, committed=67KB)
+                            (malloc=7KB #79)
+                            (mmap: reserved=60KB, committed=60KB)

--                  Internal (reserved=9444KB, committed=9444KB)
-                            (malloc=9412KB #1373)
+-                  Internal (reserved=204KB, committed=204KB)
+                            (malloc=172KB #1229)
                             (mmap: reserved=32KB, committed=32KB)

Note this improved both "GC" parts, because less metadata is allocated, and "Thread" part, because there are less GC threads needed when switching from Parallel (default) to Serial GC. This means we can get partial improvement by tuning down the number of GC threads for Parallel, G1, CMS, Shenandoah, etc. We’ll see about the thread stacks later. Note that changing the GC or the number of GC threads will have performance implications — by changing this, you are selecting another point in space-time tradeoffs.

It also improved "Class" parts, because metadata representation is slightly different. Can we squeeze out something from "Class"? Let us try Class Data Sharing (CDS), enabled with -Xshare:on:

-Total: reserved=1336279KB, committed=37311KB
+Total: reserved=1372715KB, committed=36763KB

--                    Symbol (reserved=1356KB, committed=1356KB)
-                            (malloc=900KB #65)
-                            (arena=456KB #1)
-
+-                    Symbol (reserved=503KB, committed=503KB)
+                            (malloc=502KB #12)
+                            (arena=1KB #1)

There we go, saved another 0.5 MB in internal symbol tables by loading the pre-parsed representation from the shared archive.

Now let’s focus on threads. The log would say:

-                    Thread (reserved=11357KB, committed=11357KB)
                            (thread #11)
                            (stack: reserved=11308KB, committed=11308KB)
                            (malloc=36KB #57)
                            (arena=13KB #22)

Looking into this, you can see that most of the space taken by threads are the thread stacks. You can try to trim the stack size down from the default (which appears to be 1M in this example) to something less with -Xss. Note would yield a greater risk of StackOverflowException-s, so if you do change this option, be sure to test all possible configurations of your software to look out for ill effects. Adventurously setting this to 256 KB with -Xss256k yields:

-Total: reserved=1372715KB, committed=36763KB
+Total: reserved=1368842KB, committed=32890KB

--                    Thread (reserved=11357KB, committed=11357KB)
+-                    Thread (reserved=7517KB, committed=7517KB)
                             (thread #11)
-                            (stack: reserved=11308KB, committed=11308KB)
+                            (stack: reserved=7468KB, committed=7468KB)
                             (malloc=36KB #57)
                             (arena=13KB #22)

Not bad, another 4 MB is gone. Of course, the improvement would be more drastic with more application threads, and it will quite probably be the second largest consumer of memory after Java heap.

Continuing on threading, JIT compiler itself also has threads. This partially explains why we set stack size to 256 KB, but the data above says the average stack size is still 7517 / 11 = 683 KB. Trimming the number of compiler threads down with -XX:CICompilerCount=1 and setting -XX:-TieredCompilation to enable only the latest compilation tier yields:

-Total: reserved=1368612KB, committed=32660KB
+Total: reserved=1165843KB, committed=29571KB

--                    Thread (reserved=7517KB, committed=7517KB)
-                            (thread #11)
-                            (stack: reserved=7468KB, committed=7468KB)
-                            (malloc=36KB #57)
-                            (arena=13KB #22)
+-                    Thread (reserved=4419KB, committed=4419KB)
+                            (thread #8)
+                            (stack: reserved=4384KB, committed=4384KB)
+                            (malloc=26KB #42)
+                            (arena=9KB #16)

Not bad, three threads are gone, and their stacks gone too! Again, this has performance implications: less compiler threads means slower warmup.

Trimming down Java heap size, selecting appropriate GC, trimming down the number of VM threads, trimming down the Java stack thread sizes and thread counts are the general techniques for reducing VM footprint in memory-constrained scenarios. With these, we have trimmed down our 16 MB Java heap test case to:

-Total: reserved=1373922KB, committed=74954KB
+Total: reserved=1165843KB, committed=29571KB

Slimdown: Insane Parts

What is suggested in this section is insane. Use this at your own risk. Do not try this at home.

Moving to insane parts, which involve tuning down internal VM settings. This is not guaranteed to work, and may crash and burn unexpectedly. For example, we can control the stack sizes required for our Java application by coding it carefully. But we don’t know what is going on inside the JVM itself, so trimming down the stack size for VM threads is dangerous. Still, hilarious to try with -XX:VMThreadStackSize=256:

-Total: reserved=1165843KB, committed=29571KB
+Total: reserved=1163539KB, committed=27267KB

--                    Thread (reserved=4419KB, committed=4419KB)
+-                    Thread (reserved=2115KB, committed=2115KB)
                             (thread #8)
-                            (stack: reserved=4384KB, committed=4384KB)
+                            (stack: reserved=2080KB, committed=2080KB)
                             (malloc=26KB #42)
                             (arena=9KB #16)

Ah yes, another 2 MB are gone along with compiler and GC thread stacks.

Let’s continue abusing the compiler code: why don’t we trim down the initial code cache size — the size of area for generated code? Enter -XX:InitialCodeCacheSize=4096 (bytes!):

-Total: reserved=1163539KB, committed=27267KB
+Total: reserved=1163506KB, committed=25226KB

--                      Code (reserved=49941KB, committed=2557KB)
+-                      Code (reserved=49941KB, committed=549KB)
                             (malloc=21KB #257)
-                            (mmap: reserved=49920KB, committed=2536KB)
+                            (mmap: reserved=49920KB, committed=528KB)

 -                        GC (reserved=67KB, committed=67KB)
                             (malloc=7KB #78)

Ho-ho! This will balloon up once we hit heavy compilation, but so far so good.

Looking closer to "Class" again, we can see that most of the 4 MB committed for our Hello World is the initial metadata storage size. We can trim it down with -XX:InitialBootClassLoaderMetaspaceSize=4096 (bytes!):

-Total: reserved=1163506KB, committed=25226KB
+Total: reserved=1157404KB, committed=21172KB

--                     Class (reserved=1056890KB, committed=4986KB)
+-                     Class (reserved=1050754KB, committed=898KB)
                             (classes #4)
-                            (malloc=122KB #83)
-                            (mmap: reserved=1056768KB, committed=4864KB)
+                            (malloc=122KB #84)
+                            (mmap: reserved=1050632KB, committed=776KB)

 -                    Thread (reserved=2115KB, committed=2115KB)
                             (thread #8)

Overall, after all this madness, we came even closer to 16 MB of Java heap size, wasting only 8.5 MB on top of that:

-Total: reserved=1165843KB, committed=29571KB
+Total: reserved=1157404KB, committed=21172KB

We can probably come even closer if we start yanking parts of JVM in our custom build.

Putting All Together

For fun, we can see how the native overhead changes with heap size on our test workload:

This confirms our gained intuition that GC overheads are the constant factor of Java heap size, and native VM overheads starts to matter only on lower heap sizes when the absolute values for VM overhead start to become a factor in overall footprint. This picture omits the second most important thing though: thread stacks.

Observations

Default JVM configuration is usually tailored to long-running server-class applications, and so its initial guesses about the GCs, the initial sizes for internal data structures, stack sizes, etc. may be not appropriate for short-running memory-constrained applications. Understanding what are the major memory hogs in current JVM configuration helps cramming more JVMs on the host.

Using NMT to discover where VM spends memory is usually an enlightening exercise. It almost immediately leads to insights where to get memory footprint improvements for the particular application. Hooking up online NMT monitor to performance management systems would help to adjust the JVM parameters when running actual production applications. This is much, much, much easier than trying to figure out what JVM is doing by parsing the opaque memory maps from e.g. /proc/$pid/maps.

Also see "OpenJDK and Containers" by Christine Flood.