Please Test Your (Computer) Memory

Aleksey Shipilёv, JVM/Performance Geek
Shout out at Twitter: @shipilev
Questions, comments, suggestions: aleksey@shipilev.net

This post is also available in ePUB and mobi.

1. Introduction

To date, I have followed up on hundreds of supposed JVM GC crashes, and replied to many of them with "Please test your memory." In this short bit, I would try to explain why JVMs are notoriously good at… well… detecting bad memory. This would hopefully serve as the reference for many discussions I am having recurrently on the same topic.

I would not go into the gory details how modern memory is organized, or how JVMs deal with memory, etc. I believe those are unnecessary distractions from the high-level arguments we can make here. These arguments also look obvious in hindsight!

For the purposes of this discussion, I would use the term "memory error" as the placeholder for the hardware-related memory problem like reading the bad value, and "bad memory cell" as the placeholder for any actual memory error location — although it could be a single cell, an entire row, a page, a bank, a module, etc.

2. Why JVMs Are Sensitive To Bad Memory

2.1. Reason 1: "Deployment Practices", or "Heap Sizing Policies"

JVMs are colloquially known to be memory hungry. But in many cases, that is the side effect of heap sizing policy that prefers to expand the heap instead of trying more aggressive (often stop-the-world) GCs. In concurrent GCs, this gets easier, because GC does not stop the JVM all that much. But there are throughput tradeoffs anyway: keeping CPUs busy with GC instead of whatever application is doing does have costs.

Many configurations choose to set large -Xmx (max heap size) to plan for extended capacity, accept allocation spikes, etc. Many other configurations additionally set large -Xms (initial/min heap size) to avoid heap resizing hiccups, that might involve going to OS for memory and/or pre-zeroing it. Some configurations run with -XX:+AlwaysPreTouch to get OS to commit all heap pages to physical memory right away.

When JVM is the sole process running on a host, it is common to see -Xmx to be close to the physical memory limits, minus native JVM overheads. Because apart from leaving the space for I/O caches, why keep that memory unused?

This has implications for memory errors. If a JVM process takes a lot of memory, it means its Java heap is mapped over the most of the physical memory. Which means that if there is a bad memory cell somewhere, it has a very high chance to be used for the Java heap.

2.2. Reason 2: "System Data is Pervasive", or "Pointer/Metadata Density"

If you haven’t read anything about how JVM represents objects, consider that JVM have the metadata with every object that describes its system state, including sizes, types, GC metadata, etc. Interested readers are advised to read "Java Objects Inside Out" for more discussion.

There are two notable statistical properties of common Java objects: they are usually small, and they are usually reference-rich. This translates to having a lot of object metadata overhead (mark and class words) and lots of pointers in memory (as the straight-forward representation of the Java reference).

To see how dense metadata/pointers are, let us consider the idiomatic example:

public class Person {
  String firstName;
  String lastName;
  public Person(String firstName, String lastName) ...
}

If we initialize it with something like new Person("Aleksey", "Shipilev"), then the memory representation of the entire ensemble of objects behind it would be something like:

   +0: meta,  Person.<mark-word>
   +8: meta,  Person.<class-word>
  +12: ref,   Person.firstName
  +16: ref,   Person.lastName
  +20: (padding)
  +24: meta,  String("Aleksey").<mark-word>
  +32: meta,  String("Aleksey").<class-word>
  +36: int,   String("Aleksey").hash
  +40: byte,  String("Aleksey").coder
  +41: bool,  String("Aleksey").hashIsZero
  +42: (padding)
  +44: ref,   String("Aleksey").value
  +48: meta,  byte[]("Aleksey").<mark-word>
  +56: meta,  byte[]("Aleksey").<class-word>
  +60: meta,  byte[]("Aleksey").<array-length>
  +64: byte*, byte[]("Aleksey").<contents>
  +71: (padding)
  +72: meta,  String("Shipilev").<mark-word>
  +80: meta,  String("Shipilev").<class-word>
  +84: int,   String("Shipilev").hash
  +88: byte,  String("Shipilev").coder
  +89: bool,  String("Shipilev").hashIsZero
  +90: (padding)
  +92: ref,   String("Shipilev").value
  +96: meta,  byte[]("Shipilev").<mark-word>
 +104: meta,  byte[]("Shipilev").<class-word>
 +108: meta,  byte[]("Shipilev").<array-length>
 +112: byte*, byte[]("Shipilev").<contents>
 +120: END.

Look at this contrived case: out of 120 bytes total, there are 68 bytes of metadata and 16 bytes of references. In total, there are 84 out of 120 bytes carrying the "system" data. And these "system" pieces are carried along with the rest of objects' "user" data, which means system data permeates the entire Java heap.

Again, this has implications for memory errors. The bit flip somewhere in user data — for example, in a large byte[] array — would probably go unnoticed unless it breaks specially checked applications invariants, or the application logic puts the additional verifications.

A bit flip in system data breaks the JVM’s own invariants. A bit flip in object metadata breaks the Java heap integrity: change bits in markword and GC gets confused about object state; change bits in classword, and the whole runtime is confused about the type of the object; change bits in arraylength, and generated code, runtime, GCs access something beyond the array.

To our benefit, most of those failures are more or less easily detectable: corrupting the heap usually leads to GC crashes when it walks the heap, bad memory accesses usually access something at odd pointer values, outside the mapped memory (thus, "Segmentation Fault"), etc.

Combined with the previous observation, not only you have a high chance to have a bad memory cell holding the Java heap, there is also a high chance it holds the system bits of data, and the failure breaks the program in an more or less observable way.

My usual "Well, actually" goes here. At this point, many might think "OK, the worst thing that might happen to a program is a crash". Well, actually, that is the second best thing that might happen to a program. The failure happened, and then it was detected, and the program was disallowed to continue. The real worst case scenario is not noticing this problem, corrupting the data, persisting that corruption to the data storages, gradually overwriting all backups of non-corrupted data with the corrupted bits, doing that to many terabytes of sensitive data. Good luck untangling that mess.

2.3. Reason 3: "System Data is Visited Often", or "Tracing GCs Have A Touchy/Feely Side"

Why would the bit-flips in metadata/pointers be readily observable?

Sure, the application itself uses the data, for example it goes through the references every so often. But you might be lucky to never visit the broken object with the application code — for example if your application has lots of "cold" data, or it is under light load, etc.

But the tracing GCs themselves do visit a lot of objects. In our toy example above, these are the locations that a usual tracing GC would touch:

   +0: meta,  Person.<mark-word>                ; <--- used for marking
   +8: meta,  Person.<class-word>               ; <--- used for ref iteration
  +12: ref,   Person.firstName                  ; <--- visited for marking
  +16: ref,   Person.lastName                   ; <--- visited for marking
  +20: (padding)
  +24: meta,  String("Aleksey").<mark-word>     ; <--- used for marking
  +32: meta,  String("Aleksey").<class-word>    ; <--- used for ref iteration
  +36: int,   String("Aleksey").hash
  +40: byte,  String("Aleksey").coder
  +41: bool,  String("Aleksey").hashIsZero
  +42: (padding)
  +44: ref,   String("Aleksey").value           ; <--- visited for marking
  +48: meta,  byte[]("Aleksey").<mark-word>     ; <--- used for marking
  +56: meta,  byte[]("Aleksey").<class-word>    ; <--- used for array iteration
  +60: meta,  byte[]("Aleksey").<array-length>
  +64: byte*, byte[]("Aleksey").<contents>
  +71: (padding)
  +72: meta,  String("Shipilev").<mark-word>    ; <--- used for marking
  +80: meta,  String("Shipilev").<class-word>   ; <--- used for ref iteration
  +84: int,   String("Shipilev").hash
  +88: byte,  String("Shipilev").coder
  +89: bool,  String("Shipilev").hashIsZero
  +90: (padding)
  +92: ref,   String("Shipilev").value          ; <--- visited for marking
  +96: meta,  byte[]("Shipilev").<mark-word>    ; <--- used for marking
 +104: meta,  byte[]("Shipilev").<class-word>   ; <--- used for array iteration
 +108: meta,  byte[]("Shipilev").<array-length>
 +112: byte*, byte[]("Shipilev").<contents>
 +120: END.

Outgoing references are routinely touched during the traversal of the object graph: we need to know which other objects the current object points to. Class words would be used to figure out the object sizes for either heap walks (due to heap parsability) or to see how much data to copy (for evacuation). Mark words would be used to see if the object has a forwarded copy already.

If any of those pieces of system data are bad, that is the (lucky!) opportunity for GC to crash.

There are two mitigating factors:

Tracing GCs visit the live objects only. Which means the memory errors in dead objects would go unnoticed. Having a lot of live objects in the heap turns puts this mitigating factor out of the picture. This is why bug reports often say something like: "The JVM had crashed once we started piling on more data on the heap."
Generational GCs do not visit/copy many objects most of the time. Which means the memory errors in "old generation" would go unnoticed until Full GC happens. This is why bug reports often say something like: "It was all good until JVM decided to do Full GC and crashed."

2.4. Perfect Storm

So we frequently have a "perfect storm":

Bad memory cell suddenly manifests.
Java heap spans most of the memory, including the bad memory cell.
JVM keeps pervasive metadata/pointer data in Java heap, that happens to reside on the bad memory cell.
JVM GC that visits many (all) metadata/pointers, reading a bad value off the bad memory cell.

BAM! GC crash.

And since GC code is usually small but visiting large swaths of physical memory, you would see roughly the same crash, even if the memory errors are actually random. The only difference would be subtle changes in the failing memory addresses.

3. Why Bad Memory Often Looks Like A JVM Crash

It is somewhat amusing that JVM’s own crash handlers shift the attention from the underlying OS/HW problems to the JVM itself. When a C program fails on illegal memory access (which might be caused by a memory error), it would print something like:

$ cat test.c
void main() {
  *((int*)(0xDEADBEEF)) = 0; // Accessing memory via broken ptr
}

$ gcc test.c -o test

$ ./test
Segmentation fault (core dumped)

In modern JVMs, trying to do the same trick yields the much richer diagnostics:

$ cat Crash.java
import java.lang.reflect.*;
import sun.misc.Unsafe;

public class Crash {
  public static void main(String... args) throws Exception {
    Field f = Unsafe.class.getDeclaredField("theUnsafe");
    f.setAccessible(true);
    Unsafe u = (Unsafe) f.get(null);
    u.getInt(0xDEADBEEF); // Accessing memory via broken ptr
  }
}

$ javac Crash.java
$ java Crash
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f53bbf9b1ea, pid=635939, tid=635940
#
# JRE version: OpenJDK Runtime Environment (16.0) (fastdebug build 16-internal+0-adhoc.shade.jdk)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 16-internal+0-adhoc.shade.jdk, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x19131ea]  Unsafe_GetInt+0x41a
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/shade/trunks/jdk/core.635939)
#
# An error report file with more information is saved as:
# /home/shade/trunks/jdk/hs_err_pid635939.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)

Technically it says the error was detected by JVM, it does not say it was caused by JVM. Nevertheless, it invites to submit a bug to JVM bugtracker, which many users happily do.

The hs_err_pid*.log file would even point somewhere into JVM, except the very few last frames:

---------------  S U M M A R Y ------------

Command Line: Crash

Host: shade-desktop, AMD Ryzen Threadripper 3970X 32-Core Processor, 64 cores, 125G, Ubuntu 20.04.1 LTS
Time: Sun Sep 27 11:37:21 2020 CEST elapsed time: 0.119082 seconds (0d 0h 0m 0s)

---------------  T H R E A D  ---------------

Current thread (0x00007f53b4027800):  JavaThread "main" [_thread_in_vm, id=635940, stack(0x00007f53ba438000,0x00007f53ba539000)]

Stack: [0x00007f53ba438000,0x00007f53ba539000],  sp=0x00007f53ba537660,  free space=1021k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x19131ea]  Unsafe_GetInt+0x41a
j  jdk.internal.misc.Unsafe.getInt(Ljava/lang/Object;J)I+0 java.base@16-internal
j  jdk.internal.misc.Unsafe.getInt(J)I+3 java.base@16-internal
j  sun.misc.Unsafe.getInt(J)I+4 jdk.unsupported@16-internal
j  Crash.main([Ljava/lang/String;)V+24
v  ~StubRoutines::call_stub
V  [libjvm.so+0xd6c4ea]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*)+0x62a
V  [libjvm.so+0xe98605]  jni_invoke_static(JNIEnv_*, JavaValue*, _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, Thread*) [clone .isra.0] [clone .constprop.1]+0x3e5
V  [libjvm.so+0xe9e091]  jni_CallStaticVoidMethod+0x211
C  [libjli.so+0x523e]  JavaMain+0xc2e
C  [libjli.so+0x80bd]  ThreadJavaMain+0xd

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  jdk.internal.misc.Unsafe.getInt(Ljava/lang/Object;J)I+0 java.base@16-internal
j  jdk.internal.misc.Unsafe.getInt(J)I+3 java.base@16-internal
j  sun.misc.Unsafe.getInt(J)I+4 jdk.unsupported@16-internal
j  Crash.main([Ljava/lang/String;)V+24
v  ~StubRoutines::call_stub

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0xffffffffdeadbeef

Register to memory mapping:

RAX=0x00007f53b4027800 is a thread
RBX=0x00007f53bc6bacfa: <offset 0x0000000002032cfa> in /home/shade/trunks/jdk/build/linux-x86_64-server-fastdebug/images/jdk/lib/server/libjvm.so at 0x00007f53ba688000
RCX=0x0 is NULL
RDX=0x0 is NULL
RSP=0x00007f53ba537660 is pointing into the stack for thread: 0x00007f53b4027800
RBP=0x00007f53ba537740 is pointing into the stack for thread: 0x00007f53b4027800
RSI=0x0 is NULL
RDI=0x00007f53ba537618 is pointing into the stack for thread: 0x00007f53b4027800
R8 =0x000055ea37ad6000 points into unknown readable memory: 0x0000000000000002 | 02 00 00 00 00 00 00 00
R9 =0x000055ea37ad6000 points into unknown readable memory: 0x0000000000000002 | 02 00 00 00 00 00 00 00
R10=0x00007f539c4dfe4a is at code_begin+1546 in an Interpreter codelet
method entry point (kind = native)  [0x00007f539c4df840, 0x00007f539c4e0700]  3776 bytes
R11=0x0000000000000008 is an unknown value
R12=0x00007f53b4027800 is a thread
R13=0x00007f53ba537697 is pointing into the stack for thread: 0x00007f53b4027800
R14=0x0 is NULL
R15=0x00007f53ba5376d0 is pointing into the stack for thread: 0x00007f53b4027800

Even though it is a plain error caused by the application, hs_err still helpfully prints out a lot of JVM diagnostics. For the uninitiated user, that definitely points to the bug in JVM. But that is only an illusion caused by a rich post-mortem diagnostics.

The cause might as well be hardware fault, and user process like JVM has no straight-forward way to see that. In fact, it takes a while to see if hs_err does look like a memory error. For all intents and purposes, it might be a JVM bug, because it would look roughly the same.

Even the most experienced JVM developers I know discounted memory errors as the reason for JVM failures they were seeing, spending countless hours chasing the non-existent software bug.

4. What To Do

Now we know where the trouble might be, what can we do about it? The short answer is: test, test, test.

But before that, let me make a few pleas:

Please do not overclock the hardware if you want reliability. Running the hardware outside the specification negates the post-fabrication validations done already. You might be lucky, but in the reliability game, being lucky is not enough.
Please use the memory from your vendors specifications. Many mainboard manufacturers publish the "Memory QVL" lists that mention the memory modules that were tested with a given mainboard. The vendor already did testing there, use that testing!
Please re-test regularly. Hardware degrades over time: I disqualified a few memory modules over the years of use that passed the tests fine at the beginning! The conditions change: you tested in winter months, and then memory errors show up when ambients hit +30C. Unintended hardware problems arise: e.g. your kid knocked down your laptop and memory chips got a bad connection…

4.1. Deploy ECC RAM

As the industry, we know that memory errors are common. An old, but widely cited paper ballparks the incidence rate at 25 events per Gbit per year. If we boldly assume the uniformity, then on a 128 GB machine, that amounts to about 3 events per hour!

In fact, we do regularly protect L1/L2 caches with ECC:

$ sudo dmidecode
...
Handle 0x000A, DMI type 7, 27 bytes
Cache Information
        Socket Designation: L1 - Cache
        Configuration: Enabled, Not Socketed, Level 1
        Operational Mode: Write Back
        Installed Size: 2048 kB
        Error Correction Type: Multi-bit ECC

Handle 0x000B, DMI type 7, 27 bytes
Cache Information
        Socket Designation: L2 - Cache
        Configuration: Enabled, Not Socketed, Level 2
        Operational Mode: Write Back
        Installed Size: 16384 kB
        Error Correction Type: Multi-bit ECC

For years, ECC for main memory was strictly in the realm of enterprise servers. But today, you can get decent ECC support for desktop platforms as well, and sometimes you can even see vendors qualifying their desktop systems for ECC use. These days, I never even consider going for high-density DIMMs on any important system without going ECC. All my "production" home machines run with ECC modules.

Since ECC is online most of the time, it can help against soft errors, that are temporary upsets that do not indicate the hardware failure (usual example: background radiation upsets). These soft errors are very hard to reliably reproduce due to their physical nature. The continuous memory tests catch the hard errors more or less reliably: once something is broken in hardware, it has a much higher chance to replicate well. Catching the soft errors with continuous memory tests might be problematic.

Running memory scrubbing on ECC systems is also recommended, since error correction codes are usually SECDED (Single Error Correction, Double Error Detection). Which means, if you have the single-bit upset in cold data somewhere, you want ECC to detect it very early, while it is still a single-bit error. The risk of adjacent memory error might be low, as these failures are rarely correlated, but better be safe than sorry.

4.2. Run memtester(8)

While memtest86+ is the go-to tool for testing memory, I like memtester(8) much better. Mostly because it does not require to compiletely offline the machines while testing is performed. This is very convenient for lab servers that do not have an easily attachable physical/virtual console. It is nice to go to a lab server, shut down all services like build agents, and then run memtester under a screen, until it completes or you get bored.

You would need to have a trial/error session to figure out how much memory the test can take and still leave the responsive system. On my 128G graphical desktop, 120G seems to fit. On my 128G bare lab server, 125G can be tested.

The whole thing takes many hours, depending on memory size, memory speed, etc. This round took about 6 hours on my desktop:

$ sudo memtester 120g
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 122880MB (128849018880 bytes)
got  122880MB (128849018880 bytes), trying mlock ...locked.
Loop 1:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : ok
  Walking Ones        : ok
  Walking Zeroes      : ok
  8-bit Writes        : ok
  16-bit Writes       : ok

I personally do this after any substantial hardware change or every 3 months, whatever comes first.

As noted before, these tests are not very useful to catch soft errors, but they do catch hard errors reasonably well.

4.3. Run memtest86+

The gold standard for memory testing is memtest86+. The thing I like about running it separately is multi-threading support. This might uncover more bugs. In practice, most of the faulty memory modules I had were failing in the single threaded mode too.

Every vendor I bought the memory from accepted the RMA with a short description like "memtest86+ is failing on this module". This seems to be a lingua franca in the testing world: if the module does not pass memtest86+, that is a real problem. In very rare cases the module itself is not a problem: mainboard is flaky, PSU is flaky, CPU is bad, etc.

memtest86+ seems to be the part of every serious pre-deployment checklist. I think it should be the part of every suspicious crash report checklist too, or at least it should be ran periodically to catch the memory errors early without running long memtest after any crash.

5. Conclusion

There is the adage I like:

"Everybody has a testing environment. Some people are lucky enough to have a totally separate environment to run production in.".

In the similar vein:

"Everyone runs memory tests all the time. Some people are lucky enough to run the targeted memory tests that are separate from the production logic."

If you see JVM crashes, especially during GCs, then suspect everything, including hardware.

In the perfect world, I would love to stop replying to the GC bug reports with:

Hi,

From the error log, I would first suspect a hardware memory error.

Please test your memory.

Thanks,
-Aleksey