About, Disclaimers, Contacts
"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net
Questions
Do JVMs perform profile-guided optimizations? Do JVMs profile branch frequency? Do JVMs use branch frequency information for anything?
Theory
Among other things that Java VMs bring to the ecosystem, one of the good features is always on profile-guided optimization. To understand how that works, consider that in a modern JVM, Java code is executed with several engines: the interpreter, the baseline compiler (in Hotspot, C1), the optimized compiler (in Hotspot, C2). As Java code becomes more and more hot in Hotspot, it is moving gradually from interpreter to C1, then to C2. That means the prior stages can perform the online profiling of the code, and present that data to the higher stages in compilation.
A notable piece of the useful profiling data is the branch frequency counters. The branches are relatively non-frequent in the code, and the counter itself usually needs to just record the number of times the branch was taken/skipped. This means the profiling data is not overwhelming to record.
Then, a smarter compiler can use the branch frequency data to do many things, most notably, lay out the generated code so that most often taken path is straight-forward. Yes, CPU branch predictors can alleviate some of the pain without frequency-based layouts, but straight-forward code is still better.
Can we see this in practice?
Practice
Consider this JMH benchmark:[1]
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class BranchFrequency {
@Benchmark
public void mostlyFalse() {
for (int c = 0; c < 9; c++) {
doCall(false);
}
doCall(true);
}
@Benchmark
public void mostlyTrue() {
for (int c = 0; c < 9; c++) {
doCall(true);
}
doCall(false);
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public int doCall(boolean condition) {
if (condition) {
return 1;
} else {
return 2;
}
}
}
In either test, we take either one or another branch 90% of the time.
C1 compiler does not do frequency-based code layout, so we can expect it perform a bit differently. Indeed, it does! Look at -prof perfnorm
data with -XX:TieredStopAtLevel=1
on my i5-4210U:
Benchmark Mode Cnt Score Error Units
BranchFrequency.mostlyFalse avgt 15 30.510 ± 0.212 ns/op
BranchFrequency.mostlyFalse:L1-dcache-loads avgt 3 74.572 ± 13.337 #/op
BranchFrequency.mostlyFalse:L1-dcache-stores avgt 3 40.366 ± 1.994 #/op
BranchFrequency.mostlyFalse:branch-misses avgt 3 0.004 ± 0.042 #/op
BranchFrequency.mostlyFalse:branches avgt 3 52.509 ± 7.670 #/op
BranchFrequency.mostlyFalse:cycles avgt 3 82.643 ± 14.422 #/op
BranchFrequency.mostlyFalse:instructions avgt 3 240.178 ± 32.612 #/op
BranchFrequency.mostlyTrue avgt 15 27.426 ± 0.066 ns/op
BranchFrequency.mostlyTrue:L1-dcache-loads avgt 3 74.701 ± 6.478 #/op
BranchFrequency.mostlyTrue:L1-dcache-stores avgt 3 40.105 ± 3.923 #/op
BranchFrequency.mostlyTrue:branch-misses avgt 3 0.001 ± 0.002 #/op
BranchFrequency.mostlyTrue:branches avgt 3 52.411 ± 4.591 #/op
BranchFrequency.mostlyTrue:cycles avgt 3 73.837 ± 7.356 #/op
BranchFrequency.mostlyTrue:instructions avgt 3 239.375 ± 27.798 #/op
We are running roughly the same amount of instructions, L1 accesses, branches, etc. Yet the mostlyFalse
case is slower for about 3 ns. If we look into disasembly, then sure enough the code layout is static for doCall
method, with true
branch laid out first:
# C1, mostlyTrue
2.57% ...44c: cmp $0x0,%edx ; test $condition
╭ ...44f: je ...46d
2.71% │ ...455: mov $0x1,%eax ; if true, return 0x1
5.40% │ ...45a: add $0x30,%rsp
0.84% │ ...45e: pop %rbp
1.73% │ ...45f: cmp 0x340(%r15),%rsp
│ ...466: ja ...485
8.85% │ ...46c: retq
0.31% ↘ ...46d: mov $0x2,%eax ; if false, return 0x2
0.29% ...472: add $0x30,%rsp
0.72% ...476: pop %rbp
0.06% ...477: cmp 0x340(%r15),%rsp
...47e: ja ...49b
0.45% ...484: retq
# C1, mostlyFalse
2.76% ...74c: cmp $0x0,%edx ; test $condition
╭ ...74f: je ...76d
0.22% │ ...755: mov $0x1,%eax ; if true, return 0x1
0.06% │ ...75a: add $0x30,%rsp
0.96% │ ...75e: pop %rbp
0.06% │ ...75f: cmp 0x340(%r15),%rsp
│ ...766: ja ...785
0.20% │ ...76c: retq
2.46% ↘ ...76d: mov $0x2,%eax ; if false, return 0x2
7.01% ...772: add $0x30,%rsp
0.43% ...776: pop %rbp
0.98% ...777: cmp 0x340(%r15),%rsp
...77e: ja ...79b
8.86% ...784: retq
So in mostlyTrue
case, we fall-through the branch and exit, while on mostlyFalse
case we have to take a jump and execute from another place.
If we run with C2 (either default mode, or specifically -XX:-TieredCompilation
), which does frequency-based layouts, then we would see the performance and counters are roughly the same in both cases:
Benchmark Mode Cnt Score Error Units
BranchFrequency.mostlyFalse avgt 15 24.840 ± 0.027 ns/op
BranchFrequency.mostlyFalse:L1-dcache-loads avgt 3 61.040 ± 2.702 #/op
BranchFrequency.mostlyFalse:L1-dcache-stores avgt 3 38.022 ± 0.276 #/op
BranchFrequency.mostlyFalse:branch-misses avgt 3 0.002 ± 0.036 #/op
BranchFrequency.mostlyFalse:branches avgt 3 52.012 ± 4.265 #/op
BranchFrequency.mostlyFalse:cycles avgt 3 66.616 ± 1.398 #/op
BranchFrequency.mostlyFalse:instructions avgt 3 212.290 ± 13.588 #/op
BranchFrequency.mostlyTrue avgt 15 24.829 ± 0.043 ns/op
BranchFrequency.mostlyTrue:L1-dcache-loads avgt 3 61.127 ± 4.135 #/op
BranchFrequency.mostlyTrue:L1-dcache-stores avgt 3 38.072 ± 3.190 #/op
BranchFrequency.mostlyTrue:branch-misses avgt 3 0.002 ± 0.027 #/op
BranchFrequency.mostlyTrue:branches avgt 3 52.153 ± 4.914 #/op
BranchFrequency.mostlyTrue:cycles avgt 3 66.664 ± 3.982 #/op
BranchFrequency.mostlyTrue:instructions avgt 3 212.572 ± 10.962 #/op
The disassembly would reveal an interesting fact: in either case, the most frequent branch is laid out first!
# C2, mostlyTrue
1.49% ...cac: test %edx,%edx ; check $condition
╭ ...cae: je ...cc8
1.05% │ ...cb0: mov $0x1,%eax ; if true, return 0x1
1.45% │↗ ...cb5: add $0x10,%rsp
11.09% ││ ...cb9: pop %rbp
12.00% ││ ...cba: cmp 0x340(%r15),%rsp
││ ...cc1: ja ...ccf
1.55% ││ ...cc7: retq
0.04% ↘│ ...cc8: mov $0x2,%eax ; if false, return 0x2
╰ ...ccd: jmp ...cb5
# C2, mostlyFalse
4.85% ...32c: test %edx,%edx ; check $condition
╭ ...32e: jne ...348
1.10% │ ...330: mov $0x2,%eax ; if false, return 0x2
1.42% │↗ ...335: add $0x10,%rsp
8.95% ││ ...339: pop %rbp
11.77% ││ ...33a: cmp 0x340(%r15),%rsp
0.02% ││ ...341: ja ...34f
2.01% ││ ...347: retq
0.08% ↘│ ...348: mov $0x1,%eax ; if true, return 0x1
0.12% ╰ ...34d: jmp ...335
Here, the compiler capitalized handsomely on available branch frequency data to get the uniform result.