About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Question
I look into JVM-generated machine code, and see weird XMM register usages on x86, when my code has no floating-point or vector operations at all. What is up with that?
Theory
FPU and vector units are ubiquitous in modern CPUs, and in many cases they provide the alternative sets of registers for FPU-specific operations. For example, SSE and AVX extensions in Intel x86_64 have additional set of wide XMM, YMM and ZMM registers that can be used in conjunction with wider instructions.
While the non-vector instruction set is not usually orthogonal with vector and non-vector registers (for example, we cannot use general-purpose IMUL with XMM register on x86_64), these registers still provide an interesting storage option: we can temporarily store data there, even if that data is not used for vector operations.[1]
Enter register allocation. The register allocator duty is to take the program representation with all the operands the program needs in a particular compilation unit (method, for example), and map these virtual operands to actual machine registers — allocate registers for them. In many real programs, the number of live virtual operands at given program location is greater than the number of machine registers available. At that point, register allocator has to put some operands out of the registers somewhere else — e.g. on stack — that is, spill the operands.
Now, we have 16 general purpose registers on x86_64 (not all of them are usable), and 16 more AVX registers on most modern machines. Can we spill to XMM registers instead of the stack? Yes, we can! Does it bring any benefit?
Practice
Consider this simple JMH benchmark. We construct that benchmark in a very special way (assume Java has pre-processing capabilities, for simplicity):
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class FPUSpills {
int s00, s01, s02, s03, s04, s05, s06, s07, s08, s09;
int s10, s11, s12, s13, s14, s15, s16, s17, s18, s19;
int s20, s21, s22, s23, s24;
int d00, d01, d02, d03, d04, d05, d06, d07, d08, d09;
int d10, d11, d12, d13, d14, d15, d16, d17, d18, d19;
int d20, d21, d22, d23, d24;
int sg;
volatile int vsg;
int dg;
@Benchmark
#ifdef ORDERED
public void ordered() {
#else
public void unordered() {
#endif
int v00 = s00; int v01 = s01; int v02 = s02; int v03 = s03; int v04 = s04;
int v05 = s05; int v06 = s06; int v07 = s07; int v08 = s08; int v09 = s09;
int v10 = s10; int v11 = s11; int v12 = s12; int v13 = s13; int v14 = s14;
int v15 = s15; int v16 = s16; int v17 = s17; int v18 = s18; int v19 = s19;
int v20 = s20; int v21 = s21; int v22 = s22; int v23 = s23; int v24 = s24;
#ifdef ORDERED
dg = vsg; // Confuse optimizer a little
#else
dg = sg; // Just a plain store...
#endif
d00 = v00; d01 = v01; d02 = v02; d03 = v03; d04 = v04;
d05 = v05; d06 = v06; d07 = v07; d08 = v08; d09 = v09;
d10 = v10; d11 = v11; d12 = v12; d13 = v13; d14 = v14;
d15 = v15; d16 = v16; d17 = v17; d18 = v18; d19 = v19;
d20 = v20; d21 = v21; d22 = v22; d23 = v23; d24 = v24;
}
}
It reads and writes multiple pairs of fields at once. Optimizers are not actually tied up to the particular program order. Indeed, that is what we would observe in unordered
test:
Benchmark Mode Cnt Score Error Units
FPUSpills.unordered avgt 15 6.961 ± 0.002 ns/op
FPUSpills.unordered:CPI avgt 3 0.458 ± 0.024 #/op
FPUSpills.unordered:L1-dcache-loads avgt 3 28.057 ± 0.730 #/op
FPUSpills.unordered:L1-dcache-stores avgt 3 26.082 ± 1.235 #/op
FPUSpills.unordered:cycles avgt 3 26.165 ± 1.575 #/op
FPUSpills.unordered:instructions avgt 3 57.099 ± 0.971 #/op
This gives us around 26 load-store pairs, which corresponds roughly to 25 pairs we have in the test. But we don’t have 25 general purpose registers! Perfasm reveals that optimizer had merged load-store pairs close to each other, so that register pressure is much lower:
0.38% 0.28% ↗ movzbl 0x94(%rcx),%r9d
│ ...
0.25% 0.20% │ mov 0xc(%r11),%r10d ; getfield s00
0.04% 0.02% │ mov %r10d,0x70(%r8) ; putfield d00
│ ...
│ ... (transfer repeats for multiple vars) ...
│ ...
╰ je BACK
At this point, we want to cheat the optimizer a little, and make a point of confusion so that all loads are performed well before the stores. This is what ordered
test does, and there, we can see the loads and stores are happening in bulk: first all the loads, then all the stores. The register pressure is highest at the point where all the loads have completed, but none of the stores have started yet. Even then, we have no significant difference against unordered
:
Benchmark Mode Cnt Score Error Units
FPUSpills.unordered avgt 15 6.961 ± 0.002 ns/op
FPUSpills.unordered:CPI avgt 3 0.458 ± 0.024 #/op
FPUSpills.unordered:L1-dcache-loads avgt 3 28.057 ± 0.730 #/op
FPUSpills.unordered:L1-dcache-stores avgt 3 26.082 ± 1.235 #/op
FPUSpills.unordered:cycles avgt 3 26.165 ± 1.575 #/op
FPUSpills.unordered:instructions avgt 3 57.099 ± 0.971 #/op
FPUSpills.ordered avgt 15 7.961 ± 0.008 ns/op
FPUSpills.ordered:CPI avgt 3 0.329 ± 0.026 #/op
FPUSpills.ordered:L1-dcache-loads avgt 3 29.070 ± 1.361 #/op
FPUSpills.ordered:L1-dcache-stores avgt 3 26.131 ± 2.243 #/op
FPUSpills.ordered:cycles avgt 3 30.065 ± 0.821 #/op
FPUSpills.ordered:instructions avgt 3 91.449 ± 4.839 #/op
…and that is because we have managed to spill operands to XMM registers, not on stack:
3.08% 3.79% ↗ vmovq %xmm0,%r11
│ ...
0.25% 0.20% │ mov 0xc(%r11),%r10d ; getfield s00
0.02% │ vmovd %r10d,%xmm4 ; <--- FPU SPILL
0.25% 0.20% │ mov 0x10(%r11),%r10d ; getfield s01
0.02% │ vmovd %r10d,%xmm5 ; <--- FPU SPILL
│ ...
│ ... (more reads and spills to XMM registers) ...
│ ...
0.12% 0.02% │ mov 0x60(%r10),%r13d ; getfield s21
│ ...
│ ... (more reads into registers) ...
│ ...
│ ------- READS ARE FINISHED, WRITES START ------
0.18% 0.16% │ mov %r13d,0xc4(%rdi) ; putfield d21
│ ...
│ ... (more reads from registers and putfileds)
│ ...
2.77% 3.10% │ vmovd %xmm5,%r11d : <--- FPU UNSPILL
0.02% │ mov %r11d,0x78(%rdi) ; putfield d01
2.13% 2.34% │ vmovd %xmm4,%r11d ; <--- FPU UNSPILL
0.02% │ mov %r11d,0x70(%rdi) ; putfield d00
│ ...
│ ... (more unspills and putfields)
│ ...
╰ je BACK
Notice that we do use general-purpose registers (GPRs) for some operands, but when they are depleted, we spill. "Then" is ill-defined here, because we appear to first spill, and then use GPRs, but this is a false appearance, because register allocators may operate on the complete graph.[2].
The latency of XMM spills seems minimal: even though we do claim more instructions for spills, they execute very efficiently and fill the pipelining gaps: with 34 additional instructions, which means around 17 spill pairs, we have claimed only 4 additional cycles. Note that it would be incorrect to compute the CPI as 4/34 = ~0.11 clk/insn, which would be larger than current CPUs are capable of. But the improvement is real, because we use execution blocks we weren’t using before.
The claims of efficiency mean nothing, if we don’t have anything to compare with. But here, we do! We can instruct Hotspot to avoid using FPU spills with -XX:-UseFPUForSpilling
, which gives us the idea how much do we win with XMM spills:
Benchmark Mode Cnt Score Error Units
# Default
FPUSpills.ordered avgt 15 7.961 ± 0.008 ns/op
FPUSpills.ordered:CPI avgt 3 0.329 ± 0.026 #/op
FPUSpills.ordered:L1-dcache-loads avgt 3 29.070 ± 1.361 #/op
FPUSpills.ordered:L1-dcache-stores avgt 3 26.131 ± 2.243 #/op
FPUSpills.ordered:cycles avgt 3 30.065 ± 0.821 #/op
FPUSpills.ordered:instructions avgt 3 91.449 ± 4.839 #/op
# -XX:-UseFPUForSpilling
FPUSpills.ordered avgt 15 10.976 ± 0.003 ns/op
FPUSpills.ordered:CPI avgt 3 0.455 ± 0.053 #/op
FPUSpills.ordered:L1-dcache-loads avgt 3 47.327 ± 5.113 #/op
FPUSpills.ordered:L1-dcache-stores avgt 3 41.078 ± 1.887 #/op
FPUSpills.ordered:cycles avgt 3 41.553 ± 2.641 #/op
FPUSpills.ordered:instructions avgt 3 91.264 ± 7.312 #/op
Oh, see the increased load/store counters per operation? These are stack spills: the stack itself, while fast, still resides in memory, and thus accesses to stack land in L1 cache. It is roughly the same 17 additional spill pairs, but now they take ~11 cycles. The throughput of L1 cache is the limiting factor here.
Finally, we can eyeball the perfasm output for -XX:-UseFPUForSpilling
:
2.45% 1.21% ↗ mov 0x70(%rsp),%r11
│ ...
0.50% 0.31% │ mov 0xc(%r11),%r10d ; getfield s00
0.02% │ mov %r10d,0x10(%rsp) ; <--- stack spill!
2.04% 1.29% │ mov 0x10(%r11),%r10d ; getfield s01
│ mov %r10d,0x14(%rsp) ; <--- stack spill!
│ ...
│ ... (more reads and spills to stack) ...
│ ...
0.12% 0.19% │ mov 0x64(%r10),%ebp ; getfield s22
│ ...
│ ... (more reads into registers) ...
│ ...
│ ------- READS ARE FINISHED, WRITES START ------
3.47% 4.45% │ mov %ebp,0xc8(%rdi) ; putfield d22
│ ...
│ ... (more reads from registers and putfields)
│ ...
1.81% 2.68% │ mov 0x14(%rsp),%r10d ; <--- stack unspill
0.29% 0.13% │ mov %r10d,0x78(%rdi) ; putfield d01
2.10% 2.12% │ mov 0x10(%rsp),%r10d ; <--- stack unspill
│ mov %r10d,0x70(%rdi) ; putfield d00
│ ...
│ ... (more unspills and putfields)
│ ...
╰ je BACK
Yup, the stack spills are at the similar places where we had XMM spills.
Observations
FPU spills are the nice trick to alleviate register pressure problems. While it does not increase the number of registers available for general operations, it does provide a faster temporary storage for spills: so when we need just a few additional spill slots, we can avoid tripping to L1 cache-backed stack for this.
This is sometimes the cause of interesting performance deviations: if FPU spills are not used on some critical path, we may see diminished performance. For example, introducing a slow-path GC barrier call that is assumed to trash the FPU registers may tell compiler to get back to usual stack-based spills, without trying anything fancy.
In Hotspot, -XX:+UseFPUForSpilling
is enabled by default for SSE-capable x86 platforms, ARMv7, and AArch64. So, this works with most of your programs, whether you know about this trick or not.