About

"JVM Anatomy Park" is the on-going mini-post series, where every post is slated to take 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. So, the evidence and discussion here are anecdotal, not actually reviewed for errors, consistency, writing style, syntactic and semantic errors, duplicates, or consistency. Use and/or trust this at your own risk.

Aleksey Shipilёv, Performance Geek @ Red Hat OpenJDK Team
Shout out at Twitter: @shipilev
Questions, comments, suggestions: aleksey@shipilev.net

Question

I look into JVM-generated machine code, and see weird XMM register usages on x86, when my code has no floating-point or vector operations at all. What is up with that?

Theory

FPU and vector units are ubiquitous in modern CPUs, and in many cases they provide the alternative sets of registers for FPU-specific operations. For example, SSE and AVX extensions in Intel x86_64 have additional set of wide XMM, YMM and ZMM registers that can be used in conjunction with wider instructions.

While the non-vector instruction set is not usually orthogonal with vector and non-vector registers (for example, we cannot use general-purpose IMUL with XMM register on x86_64), these registers still provide an interesting storage option: we can temporarily store data there, even if that data is not used for vector operations.[1]

Enter register allocation. The register allocator duty is to take the program representation with all the operands the program needs in a particular compilation unit (method, for example), and map these virtual operands to actual machine registers — allocate registers for them. In many real programs, the number of live virtual operands at given program location is greater than the number of machine registers available. At that point, register allocator has to put some operands out of the registers somewhere else — e.g. on stack — that is, spill the operands.

Now, we have 16 general purpose registers on x86_64 (not all of them are usable), and 16 more AVX registers on most modern machines. Can we spill to XMM registers instead of the stack? Yes, we can! Does it bring any benefit?

Practice

Consider this simple JMH benchmark. We construct that benchmark in a very special way (assume Java has pre-processing capabilities, for simplicity):

import org.openjdk.jmh.annotations.*;

import java.util.concurrent.TimeUnit;

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class FPUSpills {

    int s00, s01, s02, s03, s04, s05, s06, s07, s08, s09;
    int s10, s11, s12, s13, s14, s15, s16, s17, s18, s19;
    int s20, s21, s22, s23, s24;

    int d00, d01, d02, d03, d04, d05, d06, d07, d08, d09;
    int d10, d11, d12, d13, d14, d15, d16, d17, d18, d19;
    int d20, d21, d22, d23, d24;

    int sg;
    volatile int vsg;

    int dg;

    @Benchmark
#ifdef ORDERED
    public void ordered() {
#else
    public void unordered() {
#endif
        int v00 = s00; int v01 = s01; int v02 = s02; int v03 = s03; int v04 = s04;
        int v05 = s05; int v06 = s06; int v07 = s07; int v08 = s08; int v09 = s09;
        int v10 = s10; int v11 = s11; int v12 = s12; int v13 = s13; int v14 = s14;
        int v15 = s15; int v16 = s16; int v17 = s17; int v18 = s18; int v19 = s19;
        int v20 = s20; int v21 = s21; int v22 = s22; int v23 = s23; int v24 = s24;
#ifdef ORDERED
        dg = vsg; // Confuse optimizer a little
#else
        dg = sg;  // Just a plain store...
#endif
        d00 = v00; d01 = v01; d02 = v02; d03 = v03; d04 = v04;
        d05 = v05; d06 = v06; d07 = v07; d08 = v08; d09 = v09;
        d10 = v10; d11 = v11; d12 = v12; d13 = v13; d14 = v14;
        d15 = v15; d16 = v16; d17 = v17; d18 = v18; d19 = v19;
        d20 = v20; d21 = v21; d22 = v22; d23 = v23; d24 = v24;
    }
}

It reads and writes multiple pairs of fields at once. Optimizers are not actually tied up to the particular program order. Indeed, that is what we would observe in unordered test:

Benchmark                                  Mode  Cnt   Score    Error  Units

FPUSpills.unordered                        avgt   15   6.961 ±  0.002  ns/op
FPUSpills.unordered:CPI                    avgt    3   0.458 ±  0.024   #/op
FPUSpills.unordered:L1-dcache-loads        avgt    3  28.057 ±  0.730   #/op
FPUSpills.unordered:L1-dcache-stores       avgt    3  26.082 ±  1.235   #/op
FPUSpills.unordered:cycles                 avgt    3  26.165 ±  1.575   #/op
FPUSpills.unordered:instructions           avgt    3  57.099 ±  0.971   #/op

This gives us around 26 load-store pairs, which corresponds roughly to 25 pairs we have in the test. But we don’t have 25 general purpose registers! Perfasm reveals that optimizer had merged load-store pairs close to each other, so that register pressure is much lower:

  0.38%    0.28%  ↗  movzbl 0x94(%rcx),%r9d
                  │  ...
  0.25%    0.20%  │  mov    0xc(%r11),%r10d    ; getfield s00
  0.04%    0.02%  │  mov    %r10d,0x70(%r8)    ; putfield d00
                  │  ...
                  │  ... (transfer repeats for multiple vars) ...
                  │  ...
                  ╰  je     BACK

At this point, we want to cheat the optimizer a little, and make a point of confusion so that all loads are performed well before the stores. This is what ordered test does, and there, we can see the loads and stores are happening in bulk: first all the loads, then all the stores. The register pressure is highest at the point where all the loads have completed, but none of the stores have started yet. Even then, we have no significant difference against unordered:

Benchmark                                  Mode  Cnt   Score    Error  Units

FPUSpills.unordered                        avgt   15   6.961 ±  0.002  ns/op
FPUSpills.unordered:CPI                    avgt    3   0.458 ±  0.024   #/op
FPUSpills.unordered:L1-dcache-loads        avgt    3  28.057 ±  0.730   #/op
FPUSpills.unordered:L1-dcache-stores       avgt    3  26.082 ±  1.235   #/op
FPUSpills.unordered:cycles                 avgt    3  26.165 ±  1.575   #/op
FPUSpills.unordered:instructions           avgt    3  57.099 ±  0.971   #/op

FPUSpills.ordered                          avgt   15   7.961 ±  0.008  ns/op
FPUSpills.ordered:CPI                      avgt    3   0.329 ±  0.026   #/op
FPUSpills.ordered:L1-dcache-loads          avgt    3  29.070 ±  1.361   #/op
FPUSpills.ordered:L1-dcache-stores         avgt    3  26.131 ±  2.243   #/op
FPUSpills.ordered:cycles                   avgt    3  30.065 ±  0.821   #/op
FPUSpills.ordered:instructions             avgt    3  91.449 ±  4.839   #/op

…​and that is because we have managed to spill operands to XMM registers, not on stack:

  3.08%    3.79%  ↗  vmovq  %xmm0,%r11
                  │  ...
  0.25%    0.20%  │  mov    0xc(%r11),%r10d    ; getfield s00
  0.02%           │  vmovd  %r10d,%xmm4        ; <--- FPU SPILL
  0.25%    0.20%  │  mov    0x10(%r11),%r10d   ; getfield s01
  0.02%           │  vmovd  %r10d,%xmm5        ; <--- FPU SPILL
                  │  ...
                  │  ... (more reads and spills to XMM registers) ...
                  │  ...
  0.12%    0.02%  │  mov    0x60(%r10),%r13d   ; getfield s21
                  │  ...
                  │  ... (more reads into registers) ...
                  │  ...
                  │  ------- READS ARE FINISHED, WRITES START ------
  0.18%    0.16%  │  mov    %r13d,0xc4(%rdi)   ; putfield d21
                  │  ...
                  │  ... (more reads from registers and putfileds)
                  │  ...
  2.77%    3.10%  │  vmovd  %xmm5,%r11d        : <--- FPU UNSPILL
  0.02%           │  mov    %r11d,0x78(%rdi)   ; putfield d01
  2.13%    2.34%  │  vmovd  %xmm4,%r11d        ; <--- FPU UNSPILL
  0.02%           │  mov    %r11d,0x70(%rdi)   ; putfield d00
                  │  ...
                  │  ... (more unspills and putfields)
                  │  ...
                  ╰  je     BACK

Notice that we do use general-purpose registers (GPRs) for some operands, but when they are depleted, we spill. "Then" is ill-defined here, because we appear to first spill, and then use GPRs, but this is a false appearance, because register allocators may operate on the complete graph.[2].

The latency of XMM spills seems minimal: even though we do claim more instructions for spills, they execute very efficiently and fill the pipelining gaps: with 34 additional instructions, which means around 17 spill pairs, we have claimed only 4 additional cycles. Note that it would be incorrect to compute the CPI as 4/34 = ~0.11 clk/insn, which would be larger than current CPUs are capable of. But the improvement is real, because we use execution blocks we weren’t using before.

The claims of efficiency mean nothing, if we don’t have anything to compare with. But here, we do! We can instruct Hotspot to avoid using FPU spills with -XX:-UseFPUForSpilling, which gives us the idea how much do we win with XMM spills:

Benchmark                                  Mode  Cnt   Score    Error  Units

# Default
FPUSpills.ordered                          avgt   15   7.961 ±  0.008  ns/op
FPUSpills.ordered:CPI                      avgt    3   0.329 ±  0.026   #/op
FPUSpills.ordered:L1-dcache-loads          avgt    3  29.070 ±  1.361   #/op
FPUSpills.ordered:L1-dcache-stores         avgt    3  26.131 ±  2.243   #/op
FPUSpills.ordered:cycles                   avgt    3  30.065 ±  0.821   #/op
FPUSpills.ordered:instructions             avgt    3  91.449 ±  4.839   #/op

# -XX:-UseFPUForSpilling
FPUSpills.ordered                          avgt   15  10.976 ±  0.003  ns/op
FPUSpills.ordered:CPI                      avgt    3   0.455 ±  0.053   #/op
FPUSpills.ordered:L1-dcache-loads          avgt    3  47.327 ±  5.113   #/op
FPUSpills.ordered:L1-dcache-stores         avgt    3  41.078 ±  1.887   #/op
FPUSpills.ordered:cycles                   avgt    3  41.553 ±  2.641   #/op
FPUSpills.ordered:instructions             avgt    3  91.264 ±  7.312   #/op

Oh, see the increased load/store counters per operation? These are stack spills: the stack itself, while fast, still resides in memory, and thus accesses to stack land in L1 cache. It is roughly the same 17 additional spill pairs, but now they take ~11 cycles. The throughput of L1 cache is the limiting factor here.

Finally, we can eyeball the perfasm output for -XX:-UseFPUForSpilling:

  2.45%    1.21%  ↗  mov    0x70(%rsp),%r11
                  │  ...
  0.50%    0.31%  │  mov    0xc(%r11),%r10d    ; getfield s00
  0.02%           │  mov    %r10d,0x10(%rsp)   ; <--- stack spill!
  2.04%    1.29%  │  mov    0x10(%r11),%r10d   ; getfield s01
                  │  mov    %r10d,0x14(%rsp)   ; <--- stack spill!
                  │  ...
                  │  ... (more reads and spills to stack) ...
                  │  ...
  0.12%    0.19%  │  mov    0x64(%r10),%ebp    ; getfield s22
                  │  ...
                  │  ... (more reads into registers) ...
                  │  ...
                  │  ------- READS ARE FINISHED, WRITES START ------
  3.47%    4.45%  │  mov    %ebp,0xc8(%rdi)    ; putfield d22
                  │  ...
                  │  ... (more reads from registers and putfields)
                  │  ...
  1.81%    2.68%  │  mov    0x14(%rsp),%r10d   ; <--- stack unspill
  0.29%    0.13%  │  mov    %r10d,0x78(%rdi)   ; putfield d01
  2.10%    2.12%  │  mov    0x10(%rsp),%r10d   ; <--- stack unspill
                  │  mov    %r10d,0x70(%rdi)   ; putfield d00
                  │  ...
                  │  ... (more unspills and putfields)
                  │  ...
                  ╰  je     BACK

Yup, the stack spills are at the similar places where we had XMM spills.

Observations

FPU spills are the nice trick to alleviate register pressure problems. While it does not increase the number of registers available for general operations, it does provide a faster temporary storage for spills: so when we need just a few additional spill slots, we can avoid tripping to L1 cache-backed stack for this.

This is sometimes the cause of interesting performance deviations: if FPU spills are not used on some critical path, we may see diminished performance. For example, introducing a slow-path GC barrier call that is assumed to trash the FPU registers may tell compiler to get back to usual stack-based spills, without trying anything fancy.

In Hotspot, -XX:+UseFPUForSpilling is enabled by default for SSE-capable x86 platforms, ARMv7, and AArch64. So, this works with most of your programs, whether you know about this trick or not.


1. The extreme case of this technique is using vector registers as line buffer!
2. Some register allocators do perform linear allocation — bringing up the speed of regalloc, trading generated code efficiency