JVM Anatomy Quark #28: Frequency-Based Code Layout

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Questions

Do JVMs perform profile-guided optimizations? Do JVMs profile branch frequency? Do JVMs use branch frequency information for anything?

Theory

Among other things that Java VMs bring to the ecosystem, one of the good features is always on profile-guided optimization. To understand how that works, consider that in a modern JVM, Java code is executed with several engines: the interpreter, the baseline compiler (in Hotspot, C1), the optimized compiler (in Hotspot, C2). As Java code becomes more and more hot in Hotspot, it is moving gradually from interpreter to C1, then to C2. That means the prior stages can perform the online profiling of the code, and present that data to the higher stages in compilation.

A notable piece of the useful profiling data is the branch frequency counters. The branches are relatively non-frequent in the code, and the counter itself usually needs to just record the number of times the branch was taken/skipped. This means the profiling data is not overwhelming to record.

Then, a smarter compiler can use the branch frequency data to do many things, most notably, lay out the generated code so that most often taken path is straight-forward. Yes, CPU branch predictors can alleviate some of the pain without frequency-based layouts, but straight-forward code is still better.

Can we see this in practice?

Practice

Consider this JMH benchmark:^[1]

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class BranchFrequency {
    @Benchmark
    public void mostlyFalse() {
        for (int c = 0; c < 9; c++) {
            doCall(false);
        }
        doCall(true);
    }

    @Benchmark
    public void mostlyTrue() {
        for (int c = 0; c < 9; c++) {
            doCall(true);
        }
        doCall(false);
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public int doCall(boolean condition) {
        if (condition) {
            return 1;
        } else {
            return 2;
        }
    }
}

In either test, we take either one or another branch 90% of the time.

C1 compiler does not do frequency-based code layout, so we can expect it perform a bit differently. Indeed, it does! Look at -prof perfnorm data with -XX:TieredStopAtLevel=1 on my i5-4210U:

Benchmark                                     Mode  Cnt    Score  Error   Units

BranchFrequency.mostlyFalse                   avgt   15   30.510 ±  0.212 ns/op
BranchFrequency.mostlyFalse:L1-dcache-loads   avgt    3   74.572 ± 13.337  #/op
BranchFrequency.mostlyFalse:L1-dcache-stores  avgt    3   40.366 ±  1.994  #/op
BranchFrequency.mostlyFalse:branch-misses     avgt    3    0.004 ±  0.042  #/op
BranchFrequency.mostlyFalse:branches          avgt    3   52.509 ±  7.670  #/op
BranchFrequency.mostlyFalse:cycles            avgt    3   82.643 ± 14.422  #/op
BranchFrequency.mostlyFalse:instructions      avgt    3  240.178 ± 32.612  #/op

BranchFrequency.mostlyTrue                    avgt   15   27.426 ±  0.066 ns/op
BranchFrequency.mostlyTrue:L1-dcache-loads    avgt    3   74.701 ±  6.478  #/op
BranchFrequency.mostlyTrue:L1-dcache-stores   avgt    3   40.105 ±  3.923  #/op
BranchFrequency.mostlyTrue:branch-misses      avgt    3    0.001 ±  0.002  #/op
BranchFrequency.mostlyTrue:branches           avgt    3   52.411 ±  4.591  #/op
BranchFrequency.mostlyTrue:cycles             avgt    3   73.837 ±  7.356  #/op
BranchFrequency.mostlyTrue:instructions       avgt    3  239.375 ± 27.798  #/op

We are running roughly the same amount of instructions, L1 accesses, branches, etc. Yet the mostlyFalse case is slower for about 3 ns. If we look into disasembly, then sure enough the code layout is static for doCall method, with true branch laid out first:

# C1, mostlyTrue
  2.57%   ...44c: cmp    $0x0,%edx          ; test $condition
        ╭ ...44f: je     ...46d
  2.71% │ ...455: mov    $0x1,%eax          ; if true, return 0x1
  5.40% │ ...45a: add    $0x30,%rsp
  0.84% │ ...45e: pop    %rbp
  1.73% │ ...45f: cmp    0x340(%r15),%rsp
        │ ...466: ja     ...485
  8.85% │ ...46c: retq
  0.31% ↘ ...46d: mov    $0x2,%eax          ; if false, return 0x2
  0.29%   ...472: add    $0x30,%rsp
  0.72%   ...476: pop    %rbp
  0.06%   ...477: cmp    0x340(%r15),%rsp
          ...47e: ja     ...49b
  0.45%   ...484: retq

# C1, mostlyFalse
  2.76%   ...74c: cmp    $0x0,%edx          ; test $condition
        ╭ ...74f: je     ...76d
  0.22% │ ...755: mov    $0x1,%eax          ; if true, return 0x1
  0.06% │ ...75a: add    $0x30,%rsp
  0.96% │ ...75e: pop    %rbp
  0.06% │ ...75f: cmp    0x340(%r15),%rsp
        │ ...766: ja     ...785
  0.20% │ ...76c: retq
  2.46% ↘ ...76d: mov    $0x2,%eax          ; if false, return 0x2
  7.01%   ...772: add    $0x30,%rsp
  0.43%   ...776: pop    %rbp
  0.98%   ...777: cmp    0x340(%r15),%rsp
          ...77e: ja     ...79b
  8.86%   ...784: retq

So in mostlyTrue case, we fall-through the branch and exit, while on mostlyFalse case we have to take a jump and execute from another place.

If we run with C2 (either default mode, or specifically -XX:-TieredCompilation), which does frequency-based layouts, then we would see the performance and counters are roughly the same in both cases:

Benchmark                                     Mode  Cnt    Score    Error Units

BranchFrequency.mostlyFalse                   avgt   15   24.840 ±  0.027 ns/op
BranchFrequency.mostlyFalse:L1-dcache-loads   avgt    3   61.040 ±  2.702  #/op
BranchFrequency.mostlyFalse:L1-dcache-stores  avgt    3   38.022 ±  0.276  #/op
BranchFrequency.mostlyFalse:branch-misses     avgt    3    0.002 ±  0.036  #/op
BranchFrequency.mostlyFalse:branches          avgt    3   52.012 ±  4.265  #/op
BranchFrequency.mostlyFalse:cycles            avgt    3   66.616 ±  1.398  #/op
BranchFrequency.mostlyFalse:instructions      avgt    3  212.290 ± 13.588  #/op

BranchFrequency.mostlyTrue                    avgt   15   24.829 ±  0.043 ns/op
BranchFrequency.mostlyTrue:L1-dcache-loads    avgt    3   61.127 ±  4.135  #/op
BranchFrequency.mostlyTrue:L1-dcache-stores   avgt    3   38.072 ±  3.190  #/op
BranchFrequency.mostlyTrue:branch-misses      avgt    3    0.002 ±  0.027  #/op
BranchFrequency.mostlyTrue:branches           avgt    3   52.153 ±  4.914  #/op
BranchFrequency.mostlyTrue:cycles             avgt    3   66.664 ±  3.982  #/op
BranchFrequency.mostlyTrue:instructions       avgt    3  212.572 ± 10.962  #/op

The disassembly would reveal an interesting fact: in either case, the most frequent branch is laid out first!

# C2, mostlyTrue
  1.49%    ...cac: test   %edx,%edx         ; check $condition
        ╭  ...cae: je     ...cc8
  1.05% │  ...cb0: mov    $0x1,%eax         ; if true, return 0x1
  1.45% │↗ ...cb5: add    $0x10,%rsp
 11.09% ││ ...cb9: pop    %rbp
 12.00% ││ ...cba: cmp    0x340(%r15),%rsp
        ││ ...cc1: ja     ...ccf
  1.55% ││ ...cc7: retq
  0.04% ↘│ ...cc8: mov    $0x2,%eax         ; if false, return 0x2
         ╰ ...ccd: jmp    ...cb5

# C2, mostlyFalse
  4.85%    ...32c: test   %edx,%edx         ; check $condition
        ╭  ...32e: jne    ...348
  1.10% │  ...330: mov    $0x2,%eax         ; if false, return 0x2
  1.42% │↗ ...335: add    $0x10,%rsp
  8.95% ││ ...339: pop    %rbp
 11.77% ││ ...33a: cmp    0x340(%r15),%rsp
  0.02% ││ ...341: ja     ...34f
  2.01% ││ ...347: retq
  0.08% ↘│ ...348: mov    $0x1,%eax         ; if true, return 0x1
  0.12%  ╰ ...34d: jmp    ...335

Here, the compiler capitalized handsomely on available branch frequency data to get the uniform result.

Conclusion

Always-on profiling in multi-tiered compilation schemes allows doing interesting profile-guided optimizations transparently to users.

1. Note how the benchmark forbids the inlining of the method. This both improves the generated code fidelity, as we can see the small method compiled in full, and it also fits a rather unconventional practice of forbidding the inlining of the method, instead of Blackhole-ing its result.