About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

350

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Questions

Are there any interesting tricks that can be done when we have branch frequency data? What is a conditional move?

Theory

If you need to choose between two results coming from a branch, there are two distinct things that you can do at ISA level.

First, you can do a branch:

    # %r = (%rCond == 1) ? $v1 : $v2
    cmp %rCond, $1
    jne A
    mov %r, $v1
    jmp E
 A: mov %r, $v2
 E:

Second, you can perform a predicated instruction that is dependent on the result of the comparison. In x86, this takes the form of a conditional move (CMOV) that performs an action when a selected condition holds:

  # %r = (%rCond == 1) ? $v1 : $v2
  mov %r, $v1      ; put $v1 to %r
  cmp %rCond, ...
  cmovne %r, $v2   ; put $v2 to %r if condition is false

The upside for doing a conditional move is that it sometimes generates more compact code, like in this example, and it does not suffer from possible branch misprediction penalty. The disadvantage is that it requires computing both sides at before choosing which one would be returned, which can spend excess CPU cycles, increase register pressure, etc. In branch case, we can choose not to compute stuff after checking the condition. A well-predicted branch would then outperform the conditional move.

Therefore, the choice whether to do or not to do a conditional move highly depends on its cost prediction. This is where branch profiling helps us: it can say which branches are probably not perfectly predicted, and thus amenable for CMOV replacement. Of course, the actual cost model also includes the types of the arguments we are dealing with, the relative depth of both computation branches, etc.

Can we see how this behaves in practice?

Practice 1

Consider this JMH benchmark:

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class BranchFrequency {
    @Benchmark
    public void fair() {
        doCall(true);
        doCall(false);
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public int doCall(boolean condition) {
        if (condition) {
            return 1;
        } else {
            return 2;
        }
    }
}

We flip-flop between the branches on every call, which means its runtime profile is roughly 50%-50% between them. If we limit the conditional move replacement by supplying -XX:ConditionalMoveLimit=0, then we can clearly see the replacement happenning.

# doCall, out of box variant
  4.36%  ...4ac: mov    $0x1,%r11d         ; move $1 -> %r11
  3.24%  ...4b2: mov    $0x2,%eax          ; move $2 -> %res
  8.46%  ...4b7: test   %edx,%edx          ; test boolean
  0.02%  ...4b9: cmovne %r11d,%eax         ; if false, move %r11 -> %res
  7.88%  ...4bd: add    $0x10,%rsp         ; exit the method
  8.12%  ...4c1: pop    %rbp
 18.60%  ...4c2: cmp    0x340(%r15),%rsp
         ...4c9: ja     ...4d0
  0.14%  ...4cf: retq

# doCall, CMOV conversion inhibited
  6.48%    ...cac: test   %edx,%edx         ; test boolean
        ╭  ...cae: je     ...cc8
        │                                   ; if true...
        │  ...cb0: mov    $0x1,%eax         ; move $1 -> %res
  7.41% │↗ ...cb5: add    $0x10,%rsp        ; exit the method
  0.02% ││ ...cb9: pop    %rbp
 27.43% ││ ...cba: cmp    0x340(%r15),%rsp
        ││ ...cc1: ja     ...ccf
  3.28% ││ ...cc7: retq
        ││                                  ; if false...
  7.04% ↘│ ...cc8: mov    $0x2,%eax         ; move $2 -> %res
  0.02%  ╰ ...ccd: jmp    ...cb5            ; jump back

The CMOV version performs a little better in this example:

Benchmark                              Mode  Cnt   Score    Error  Units

# Branches
BranchFrequency.fair                   avgt   25   5.422 ±  0.026  ns/op
BranchFrequency.fair:L1-dcache-loads   avgt    5  12.078 ±  0.226   #/op
BranchFrequency.fair:L1-dcache-stores  avgt    5   5.037 ±  0.120   #/op
BranchFrequency.fair:branch-misses     avgt    5   0.001 ±  0.003   #/op
BranchFrequency.fair:branches          avgt    5  10.037 ±  0.216   #/op
BranchFrequency.fair:cycles            avgt    5  14.659 ±  0.285   #/op
BranchFrequency.fair:instructions      avgt    5  35.184 ±  0.559   #/op

# CMOVs
BranchFrequency.fair                   avgt   25   4.799 ±  0.094  ns/op
BranchFrequency.fair:L1-dcache-loads   avgt    5  12.014 ±  0.329   #/op
BranchFrequency.fair:L1-dcache-stores  avgt    5   5.005 ±  0.167   #/op
BranchFrequency.fair:branch-misses     avgt    5  ≈ 10⁻⁴            #/op
BranchFrequency.fair:branches          avgt    5   7.054 ±  0.118   #/op
BranchFrequency.fair:cycles            avgt    5  12.964 ±  1.451   #/op
BranchFrequency.fair:instructions      avgt    5  36.285 ±  0.713   #/op

You might think that was because the branch misprediction penalty is not there for CMOV, but that explanation is at odds with counters. Note that "branch-misses" is nearly zero in both cases. This is because hardware branch predictors can actually remember a short branching history, and this flip-flopping branch poses no problems to them. The actual cause for performance difference is jumping in the branchy case: we have an additional control-flow instruction on the critical path.

Practice 2

To see the effects of branch misprediction penalties, we need to do a more advanced test, like this:

@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class AdjustableBranchFreq {

    @Param("50")
    int percent;

    boolean[] arr;

    @Setup(Level.Iteration)
    public void setup() {
        final int SIZE = 100_000;
        final int Q = 1_000_000;
        final int THRESH = percent * Q / 100;
        arr = new boolean[SIZE];
        ThreadLocalRandom current = ThreadLocalRandom.current();
        for (int c = 0; c < SIZE; c++) {
            arr[c] = current.nextInt(Q) < THRESH;
        }

        // Avoid uncommon traps on both branches.
        doCall(true);
        doCall(false);
    }

    @Benchmark
    public void test() {
        for (boolean cond : arr) {
            doCall(cond);
        }
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public int doCall(boolean condition) {
        if (condition) {
            return 1;
        } else {
            return 2;
        }
    }
}

Running it with different percent values and -prof perfnorm JMH profiler would yield this:

branches cmovs

You can clearly see a few things:

  1. The number of branches per test is about 5, and CMOV conversion drops it to 4. This correlates with the disassembly dumps before: we have one of the branches in the test converted to CMOV. Another 4 branches are from the test infrastructure itself.

  2. Without CMOVs, the branch test performance suffers and gets the worst at 50% branch probability. This peak reflects the nearly-absolute confusion of hardware branch predictor, as it experiences about 0.5 branch misses per operation. This means the branch predictor does not guess wrong all the time (that would be luducrous!), but just half of the time. I speculate that history-based predictor just gives up and lets the static predictor choose the closest branch, which we take half of the time.

  3. With CMOVs we can see nearly-flat operation times once it kicks in. This graph says that CMOV cost model was probably too conservative for this test, and it switched a bit too late. It does not necessarily mean it has a bug, because other cases would quite probably perform differently. Still, the improvements against branchy case are massive when CMOV conversion takes place.

  4. You might notice that branchy variant dips under CMOV middle average when branches are predicted with >97% accuracy. Of course, this is again test, HW, VM-specific thing.

Conclusion

Branch profiling allows making more or less informed choices about doing the probability-sensitive instruction selection. Conditional moves replacement routinely uses branch frequency information to drive the substitution. Once again, this underlines the need to warm up the JIT-compiled code with the data that resemble real data, so that compilers can optimize efficiently for the particular case.