Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

Note
This post is also available in ePUB and mobi.

Preface

With a plethora of new languages coming and going, people frequently try to do the cross-language performance comparisons. In most cases, that’s a strange thing to do, because even the minute differences in implementations can drive the benchmark scores off the cliff. It’s difficult even to compare a single small change in the otherwise stable environment. Trying to do the same when language compilers, bytecode generators, optimization opportunities, API differences, and contract obligations may differ drastically, happens to be orders of magnitude more complicated.

In this post, we will explore one of Java vs Scala comparisons. This is the extended version of my answer in the relevant thread on StackOverlow, and here we would see how to approach these problems, and what conclusions one may draw from the experiments.

As a good tradition, we will take some diversions into benchmarking methodology, so that even though the post itself is targeted to platform people, the non-platform people can still learn a few tricks. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.

Thanks to Nitsan Wakart and Pavel Rappo for review!

Benchmark

The discussion in that particular StackOverflow thread dates back a few questions, so instead of digging there, we will just take the latest benchmark code, and wrap it up with JMH. JMH already has the bindings for Java and Scala, which somewhat alleviates the difference in testing methodology. You can find full benchmark code here (warning, it contains spoilers).

The original benchmark may be expressed with the help of JMH as follows:

@State(Scope.Benchmark)
class ScalaBench {

  @Param(Array("1", "5", "10", "15", "20"))
  var lim: Int = _

  @tailrec private def isEvenlyDivisible(v: Int, div: Int, lim: Int): Boolean = {
    if (div > lim) true
    else (v % div == 0) && isEvenlyDivisible(v, div + 1, lim)
  }

  @Benchmark
  def test(): Int = {
    var v = 10
    while(!isEvenlyDivisible(v, 2, lim))
      v += 2
    v
  }
}

What does this thing do? This code tries to find the first even integer past 10, which is evenly divisible by all integers in [2..lim). isEvenlyDivisible is the tail-recursive implementation of the search through div.

The straight rewrite of the same benchmark in Java yields:

@State(Scope.Benchmark)
public class JavaBench {

    @Param({"1", "5", "10", "15", "20"})
    int lim;

    private boolean isEvenlyDivisible(int val, int div, int lim) {
        if (div > lim)
            return true;
        else
            return (val % div == 0) && isEvenlyDivisible(val, div + 1, lim);
    }

    @Benchmark
    public int test() {
        int val = 10;
        while(!isEvenlyDivisible(val, 2, lim))
            val += 2;
        return val;
    }

}

Okay, now we have two seemingly similar benchmarks, let’s run them! Shall we? But before we do that, let us try our intuition and make some predictions. Most people would predict that since JVMs do not have tail recursion elimination yet and also are rather pessimistic with recursive inlining, the @tailrec variant exploded by scalac itself would bring in the performance of ordinary looping. Therefore, one would conclude Scala benchmark should win big time over Java benchmark.

Good, let’s see. Running this on my laptop (2x2 i5-2520M, 2.0 GHz, Linux x86_64, JDK 8 GA) yields:

Benchmark             (lim)   Mode   Samples        Score  Score error    Units
n.s.ScalaBench.test       1   avgt        15        0.002        0.000    us/op
n.s.ScalaBench.test       5   avgt        15        0.494        0.005    us/op
n.s.ScalaBench.test      10   avgt        15       24.228        0.268    us/op
n.s.ScalaBench.test      15   avgt        15     3457.733       33.070    us/op
n.s.ScalaBench.test      20   avgt        15  2505634.259    15366.665    us/op

Benchmark             (lim)   Mode   Samples        Score  Score error    Units
n.s.JavaBench.test        1   avgt        15        0.002        0.000    us/op
n.s.JavaBench.test        5   avgt        15        0.252        0.001    us/op
n.s.JavaBench.test       10   avgt        15       12.782        0.325    us/op
n.s.JavaBench.test       15   avgt        15     1615.890        7.647    us/op
n.s.JavaBench.test       20   avgt        15  1053187.731    20502.217    us/op

That’s odd: Scala benchmark is actually twice as slow. Consistently. At this point, people try to muck the benchmark into showing what they want it to show. The obvious thing to consider is inhibiting tail recursion elimination altogether. It is a bit confusing to also drop private, but this is needed since scalac apparrently does tailrec elimination even for unannotated private methods.

--- ScalaBench.scala
+++ ScalaBenchNoTailrec.scala
@@ -10,12 +12,12 @@
 @State(Scope.Benchmark)
-class ScalaBench {
+class ScalaBenchNoTailrec {

   @Param(Array("1", "5", "10", "15", "20"))
   var lim: Int = _

-  @tailrec private def isEvenlyDivisible(v: Int, div: Int, lim: Int): Boolean = {
+  def isEvenlyDivisible(v: Int, div: Int, lim: Int): Boolean = {
     if (div > lim) true
     else (v % div == 0) && isEvenlyDivisible(v, div + 1, lim)
   }

This gets the performance back to Java level:

Benchmark                      (lim)  Mode  Samples        Score  Score error  Units
n.s.ScalaBenchNoTailrec.test       1  avgt       15        0.002        0.000  us/op
n.s.ScalaBenchNoTailrec.test       5  avgt       15        0.253        0.002  us/op
n.s.ScalaBenchNoTailrec.test      10  avgt       15       12.482        0.056  us/op
n.s.ScalaBenchNoTailrec.test      15  avgt       15     1615.506       11.100  us/op
n.s.ScalaBenchNoTailrec.test      20  avgt       15  1055345.330    10848.845  us/op

Everyone is happy, except for everyone else who is sad. "Why would such a nice scalac optimization turn the code into slowish nightmare?", they would ask themselves.

Analyzing Bottlenecks

There are two basic approaches to analyze a benchmark:

  1. You juggle the free variables and see how benchmark responds to them, thus doing positive and negative control to figure out whether your experimental setup is sane. The good examples of this approach are my previous posts "The Exceptional Performance of Lil' Exception" and "Nanotrusting the Nanotime". This method dubs the scientific approach from natural sciences where there is no better way to confirm the experimental setup.

  2. You observe and poke around the benchmark while it is running, thus analyzing bottlenecks. Brendan Gregg calls it "Active Benchmarking". This is something that differs software engineering from natural science — here, we can disassemble/profile/observe all the layers down our artificial Universe to see what is happening. Physicists would go for a kill to have an opportunity like that in their field of study.

Here, we would explore the bottlenecks part of methodology. What could be easier than to profile a benchmark, right? There are plenty of profilers in the world, just pick one, and shoot. That’s not what you should start with though: before trying to profile the application itself, make sure that anything else is not a problem (high sys%, iowait%, etc: see our Performance Methodology talks). We will briefly mention that both Java and Scala benchmarks indeed runs in usr% nicely.

Java Stack Profilers

Let us start with some blunt tools. JMH ships with naive sampling profiler, added there for two reasons: a) fun; b) letting people quickly diagnose horrible mistakes in their benchmarks (also fun). Let’s activate it with -prof stack, select a single lim with -p lim=10, and see:

Java

Result: 12.719 ±(99.9%) 0.284 us/op [Average]
  Statistics: (min, avg, max) = (12.642, 12.719, 12.795), stdev = 0.074
  Confidence interval (99.9%): [12.435, 13.003]

....[Thread state distributions].....................................................
 91.3%         RUNNABLE
  8.7%         WAITING

....[Thread state: RUNNABLE].........................................................
 58.0%  63.5% net.shipilev.JavaBench.isEvenlyDivisible
 32.9%  36.1% net.shipilev.JavaBench.test
  0.2%   0.2% java.lang.Thread.currentThread
  0.2%   0.2% sun.reflect.Reflection.getClassAccessFlags

....[Thread state: WAITING]..........................................................
  8.7% 100.0% sun.misc.Unsafe.park

Scala

Result: 24.076 ±(99.9%) 0.728 us/op [Average]
  Statistics: (min, avg, max) = (23.930, 24.076, 24.403), stdev = 0.189
  Confidence interval (99.9%): [23.348, 24.804]

....[Thread state distributions].....................................................
 91.4%         RUNNABLE
  8.6%         WAITING

....[Thread state: RUNNABLE].........................................................
 90.6%  99.1% net.shipilev.ScalaBench.test
  0.9%   0.9% net.shipilev.generated.ScalaBench_test.test_avgt_jmhLoop

....[Thread state: WAITING]..........................................................
  8.6% 100.0% sun.misc.Unsafe.park

Okay, nothing very interesting here. It seems that in Scala version, isEvenlyDivisible is folded and inlined into test. Actually, this profile looks a bit wrong, because we would expect test to be perfectly inlined into JMH’s …​jmhLoop method which contains the super-optimized generated code for the benchmark. We even force compiler to inline it there. But not to worry, this is what thread stacks are lying to us about: they show us where the thread is residing in some Java wonderland, not where it really is in the compiled code.

Hardware Counters Profilers

We need to go a level deeper. We have to tap into the very fabric of the benchmark. For things like this we have the ultimate "perfasm" profiler in JMH 0.9+, which combines the powers of Linux perf and internal JVM disassemblers. Let us unlock it with -prof perfasm and look into the output. It normally dumps quite a bit of info, but we will digest it in parts. Since we have the printed assembly on our hands, thanks to -XX:+PrintAssembly, we can map those events back to Java code, and figure out what Java or native method those events belong to. The tables below contain the aggregated statistics from perf profiling for "cycles" and "instruction" events in first two columns, respectively.

Java

JavaBench had failed to inline isEvenlyDivisible, because it is the recursive method:

....[Hottest Methods (after inlining)]...............................................
 52.67%   50.95%  net.shipilev.JavaBench::isEvenlyDivisible
 44.11%   46.18%  net.shipilev.generated.JavaBench_test::test_avgt_jmhLoop
  1.52%    1.35%  <kernel>
  0.89%    0.94%  sun.reflect.ClassFileAssembler::cpi
  0.10%    0.11%  vfprintf (libc-2.15.so)
 (...skipped...)
.....................................................................................
100.00%   99.89%  <totals>

Scala

Here, we can see that ScalaBench is perfectly compiled into a single generated method, which means @tailrec really worked!

....[Hottest Methods (after inlining)]...............................................
 95.76%   96.42%  net.shipilev.generated.ScalaBench_test::test_avgt_jmhLoop
  2.30%    1.54%  <kernel>
  1.54%    1.75%  java.io.DataOutputStream::writeInt
  0.14%    0.10%  print_insn (libhsdis-amd64.so)
  0.10%    0.07%  [unknown] (perf-3369.map)
  0.06%    0.01%  oappend (libhsdis-amd64.so)
 (...skipped...)
.....................................................................................
100.00%   99.96%  <totals>

Still, Scala benchmark is slower, why? Now that we established the areas where our benchmark spends time, we can dive into assembly.

Hardware Counters + PrintAssembly

It really helps to have megabytes of assembly to be annotated with perf event counters, because it contrasts the part we should actually digest. Good hardware-counter-enabled profilers do that (Solaris Studio Performance Analyzer, for example, does that for Java code nicely). JMH’s perfasm does that, see below.

Scala

Here, this is Scala’s hottest region, dominating with 95.76% CPU cycles:

....[Hottest Region 1].......................................................................
 [0x7fbb20f8574f:0x7fbb20f85782] in net.shipilev.generated.ScalaBench_test::test_avgt_jmhLoop

                    0x00007fbb20f85747: pop    %rbp
                    0x00007fbb20f85748: test   %eax,0x15d3a8b2(%rip)
                    0x00007fbb20f8574e: retq

                  ; <loop starts>
  2.57%    0.20%    0x00007fbb20f8574f: add    $0x2,%r11d
  0.01%    0.01%    0x00007fbb20f85753: test   %eax,0x15d3a8a7(%rip)
  0.43%    0.70%    0x00007fbb20f85759: mov    $0x2,%ecx
  0.01%    0.01%    0x00007fbb20f8575e: xchg   %ax,%ax
  2.29%    3.12%    0x00007fbb20f85760: test   %ecx,%ecx
                    0x00007fbb20f85762: je     0x00007fbb20f8581f
  0.14%             0x00007fbb20f85768: mov    %r11d,%eax
  4.44%    6.91%    0x00007fbb20f8576b: cmp    $0x80000000,%eax
                    0x00007fbb20f85770: jne    0x00007fbb20f85779
                    0x00007fbb20f85772: xor    %edx,%edx
                    0x00007fbb20f85774: cmp    $0xffffffffffffffff,%ecx
                    0x00007fbb20f85777: je     0x00007fbb20f8577c

                  ; <INTEGER DIVISION>
  0.27%    0.17%    0x00007fbb20f85779: cltd
  2.24%   17.26%    0x00007fbb20f8577a: idiv   %ecx
 77.99%   66.44%    0x00007fbb20f8577c: test   %edx,%edx

                  ; <back branch for the loop>
                    0x00007fbb20f8577e: jne    0x00007fbb20f8574f
  4.56%    1.41%    0x00007fbb20f85780: inc    %ecx
  0.72%    0.13%    0x00007fbb20f85782: cmp    %r10d,%ecx
                    0x00007fbb20f85785: jl     0x00007fbb20f85760
                    0x00007fbb20f85787: mov    0xb0(%r9),%r10d
                    0x00007fbb20f8578e: mov    0xb4(%r9),%edx

We see the very tight loop here, and the integer division is the largest consumer of CPU cycles. How can you possibly beat this twice in throughput, like Java benchmark does? How on Earth can you run idiv any faster?

Java

Let us dive into Java benchmark assembly to answer that. The assembly for hot parts is larger, because both tail-recursion and recursive inlining of isEvenlyDivisible had not happened. Let us see isEvenlyDivisible part first. I skipped the irrelevant parts from the assembly, and also pruned some of the non-essential comments:

....[Hottest Region 1].......................................................................
 [0x7fa7c5196840:0x7fa7c51968b5] in net.shipilev.JavaBench::isEvenlyDivisible

 (...skipped...)

                  ; <INTEGER DIVISION>
  1.68%    2.76%    0x00007fa7c5196868: cltd
  0.06%    0.16%    0x00007fa7c5196869: idiv   %ecx
 27.59%   36.37%    0x00007fa7c519686b: test   %edx,%edx
                    0x00007fa7c519686d: jne    0x00007fa7c51968a8
 (...skipped...)

                  ; <INTEGER DIVISION>
  0.04%             0x00007fa7c5196891: cltd
                    0x00007fa7c5196892: idiv   %r10d
 12.24%    1.54%    0x00007fa7c5196895: test   %edx,%edx
                    0x00007fa7c5196897: jne    0x00007fa7c51968a8
 (...skipped...)

                  ; <RECURSIVE CALL TO isEvenlyDivisible()>
  0.01%             0x00007fa7c519689f: callq  0x00007fa7c5045d60
  0.04%             0x00007fa7c51968a4: test   %eax,%eax
 (...skipped...)

Here, we see two idiv-s which share the CPU cycles, and accounting for ~40% of cycles, and then the recursive call back to us. Nothing magical is there, except for two idiv-s, but that is because we had inlined the first call to isEvenlyDivisible, and then stopped:

 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)
   @ 19   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   inline (hot)
     @ 19   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   recursive inlining is too deep

Okay, moving on to the main benchmark method, which had inlined test internally:

 [0x7fa7c51a6264:0x7fa7c51a630a] in net.shipilev.generated.JavaBench_test::test_avgt_jmhLoop

 (...skipped...)
  1.32%    0.11%    0x00007fa7c51a626e: movslq %r9d,%rdx
  1.15%    1.77%    0x00007fa7c51a6271: mov    %r9d,%r10d
  0.88%    1.31%    0x00007fa7c51a6274: sar    $0x1f,%r10d

                  ; <INTEGER "DIVISION">
  1.34%    0.21%    0x00007fa7c51a6278: imul   $0x55555556,%rdx,%rdx
  1.25%    0.20%    0x00007fa7c51a627f: sar    $0x20,%rdx
  1.15%    2.36%    0x00007fa7c51a6283: mov    %edx,%esi
  0.95%    1.51%    0x00007fa7c51a6285: sub    %r10d,%esi            ; irem
  1.79%    0.50%    0x00007fa7c51a6288: mov    %esi,%edx
  2.12%    1.87%    0x00007fa7c51a628a: shl    %edx
  1.72%    3.15%    0x00007fa7c51a628c: add    %esi,%edx
  1.55%    2.59%    0x00007fa7c51a628e: sub    %edx,%edi
  2.90%    2.61%    0x00007fa7c51a6290: add    $0x2,%edi
 (...skipped...)

                  ; <CALL TO isEvenlyDivisible()>
  1.58%    2.99%    0x00007fa7c51a62b7: callq  0x00007fa7c5045d60
  0.18%    0.34%    0x00007fa7c51a62bc: test   %eax,%eax
 (...skipped...)

Wait. I see the reference to irem there, which is a bytecode instruction for integer remainder, but where is idiv? This is actually a nice trick from Hacker’s Delight, Ch. 10, Sect. 19. The trick is to compute remainder for a given divisor by multiplying to the reciprocal of that divisor, and shifting right to recover the remainder bits. HD has a nice section to explain how that works.

Hypothesis, Theory, Methodology

Embracing A Hunch

Okay, we now figured that the difference between Java benchmark and Scala benchmark is in the way we compute the remainder op. That optimization is only possible when you know the divisor, how’s that Java benchmark had figured it out, and Scala bench did not? It gets rather obvious when you look into inlining trees (available with -XX:+PrintInlining, and in JMH saved perfasm output):

net.shipilev.generated.JavaBench_test::test_avgt_jmhLoop (55 bytes)
 @ 7   java.lang.System::nanoTime (0 bytes)   (intrinsic)
 @ 16   net.shipilev.JavaBench::test (24 bytes)   force inline by CompilerOracle
   @ 10   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   inline (hot)
     @ 19   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   inline (hot)
   @ 10   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   inline (hot)
     @ 19   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   inline (hot)
       @ 19   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   recursive inlining is too deep
 @ 19   org.openjdk.jmh.infra.Blackhole::consume (39 bytes)   force inline by CompilerOracle
 @ 16   net.shipilev.JavaBench::test (24 bytes)   force inline by CompilerOracle
   @ 10   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   inline (hot)
     @ 19   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   inline (hot)
   @ 10   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   inline (hot)
     @ 19   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   inline (hot)
       @ 19   net.shipilev.JavaBench::isEvenlyDivisible (31 bytes)   recursive inlining is too deep
 @ 19   org.openjdk.jmh.infra.Blackhole::consume (39 bytes)   force inline by CompilerOracle
 @ 36   java.lang.System::nanoTime (0 bytes)   (intrinsic)

We are able to inline a few first slabs of isEvenlyDivisible, and then we compile those slabs knowing the divisor on first iterations! The generic division is there in inEvenlyDivisible method body, where divisor is read from the method argument, and therefore unpredictable.

Scala version looks neat and optimized:

net.shipilev.generated.ScalaBench_test::test_avgt_jmhLoop (55 bytes)
 @ 7   java.lang.System::nanoTime (0 bytes)   (intrinsic)
  @ 16   net.shipilev.ScalaBench::test (25 bytes)   force inline by CompilerOracle
   @ 7   net.shipilev.ScalaBench::lim (5 bytes)   accessor
   @ 10   net.shipilev.ScalaBench::isEvenlyDivisible (29 bytes)   inline (hot)
   @ 7   net.shipilev.ScalaBench::lim (5 bytes)   accessor
   @ 10   net.shipilev.ScalaBench::isEvenlyDivisible (29 bytes)   inline (hot)
 @ 19   org.openjdk.jmh.infra.Blackhole::consume (39 bytes)   force inline by CompilerOracle
 @ 16   net.shipilev.ScalaBench::test (25 bytes)   force inline by CompilerOracle
   @ 7   net.shipilev.ScalaBench::lim (5 bytes)   accessor
   @ 10   net.shipilev.ScalaBench::isEvenlyDivisible (29 bytes)   inline (hot)
   @ 7   net.shipilev.ScalaBench::lim (5 bytes)   accessor
   @ 10   net.shipilev.ScalaBench::isEvenlyDivisible (29 bytes)   inline (hot)
 @ 19   org.openjdk.jmh.infra.Blackhole::consume (39 bytes)   force inline by CompilerOracle
 @ 36   java.lang.System::nanoTime (0 bytes)   (intrinsic)

…​but that’s the optimization turned sour. We indeed folded the recursion in the loop, but we were not able to optimize for starting divisors, see the assembly for ScalaBench loop above.

Modeling The Effect

So, we hypothesed peeling for first divisors should work. Let’s try to observe this effect in more controlled environment. Oh, I like to play an optimizing compiler on my own. That is, take this method as the example:

@Benchmark
public int test_02() {
  forval: for (int val = 10; ; val += 2) {
    for (int div = 2; div < lim; div++) {
       if (val % div != 0) continue forval;
    }
    return val;
  }
}

…​and try to peel the first iterations from the loop, like this:

@Benchmark
public int test_05() {
  forval: for (int val = 10; ; val += 2) {
    if (val % 2 != 0) continue;
    if (val % 3 != 0) continue;
    if (val % 4 != 0) continue;
    for (int div = 5; div < lim; div++) {
      if (val % div != 0) continue forval;
    }
    return val;
  }
}

We can produce multiple methods like this test_N, where N is the starting div. It is not, strictly speaking, the equivalent transformation when N > lim because we do the excess futile divisions, but we can workaround that by measuring test_02..test_10 only for lim=[10..20]. There:

Benchmark                     (lim)   Mode   Samples        Score  Score error  Units
n.s.JavaBenchPeeling.test_02     10   avgt        15       23.793        0.094  us/op
n.s.JavaBenchPeeling.test_03     10   avgt        15       15.439        0.048  us/op
n.s.JavaBenchPeeling.test_04     10   avgt        15        9.372        0.047  us/op
n.s.JavaBenchPeeling.test_05     10   avgt        15        6.800        0.034  us/op
n.s.JavaBenchPeeling.test_06     10   avgt        15        5.589        0.024  us/op
n.s.JavaBenchPeeling.test_07     10   avgt        15        6.086        0.058  us/op
n.s.JavaBenchPeeling.test_08     10   avgt        15        5.879        0.029  us/op
n.s.JavaBenchPeeling.test_09     10   avgt        15        5.890        0.059  us/op
n.s.JavaBenchPeeling.test_10     10   avgt        15        5.369        0.075  us/op

n.s.JavaBenchPeeling.test_02     15   avgt        15     3393.344       15.704  us/op
n.s.JavaBenchPeeling.test_03     15   avgt        15     2214.234        2.848  us/op
n.s.JavaBenchPeeling.test_04     15   avgt        15     1371.927       53.220  us/op
n.s.JavaBenchPeeling.test_05     15   avgt        15      956.982        1.420  us/op
n.s.JavaBenchPeeling.test_06     15   avgt        15      891.443       76.253  us/op
n.s.JavaBenchPeeling.test_07     15   avgt        15      784.682        4.270  us/op
n.s.JavaBenchPeeling.test_08     15   avgt        15      763.488        7.196  us/op
n.s.JavaBenchPeeling.test_09     15   avgt        15      869.820        3.332  us/op
n.s.JavaBenchPeeling.test_10     15   avgt        15      767.905        4.435  us/op

n.s.JavaBenchPeeling.test_02     20   avgt        15  2469200.228     1702.810  us/op
n.s.JavaBenchPeeling.test_03     20   avgt        15  1551785.134     4022.444  us/op
n.s.JavaBenchPeeling.test_04     20   avgt        15   883515.951     4080.213  us/op
n.s.JavaBenchPeeling.test_05     20   avgt        15   626238.203    10623.455  us/op
n.s.JavaBenchPeeling.test_06     20   avgt        15   565646.685     3103.268  us/op
n.s.JavaBenchPeeling.test_07     20   avgt        15   509773.855     4233.588  us/op
n.s.JavaBenchPeeling.test_08     20   avgt        15   492926.289     2542.706  us/op
n.s.JavaBenchPeeling.test_09     20   avgt        15   562482.076     2269.882  us/op
n.s.JavaBenchPeeling.test_10     20   avgt        15   494944.558     2444.475  us/op

We see the peeling really helps up to test_06, which contains four peeled iterations. We know the divisors for those iterations exactly, and so we generate the optimized code for them. The improvements past test_06 are diminishing quickly because more and more val-s get filtered early with short-cut checks, to the point each new check is not contributing to execution time any more — you can see that in annotated profiles, but this post is already too long.

The Methodology Quirk

Our theory explains why Scala benchmark without @tailrec is faster: we put everything in a single method, which helped us to avoid the excessive recursive call, but at the same time it robbed us of opportunity to peel the first predictable computations!

Now, this is arguably a benchmarking quirk. Most people think about dead-code elimination is the largest impediment for a good benchmarking. They usually forget that compilers can not only figure out the computation is redundant because of unclaimed outputs, but also fold some redundant computations because of the predictable inputs! JMH samples warn you about this. JVMLS talks warn you about this. And yet, everybody fails from time to time. Experienced guys fail less frequently though.

Okay, how do we fix the benchmarks JMH-style? Easy, read the values from heap, and let JMH to take care of the rest:

--- ScalaBench.scala
+++ ScalaBenchNonPredict.scala
@@ -10,11 +12,20 @@
 @State(Scope.Benchmark)
-class ScalaBench {
+class ScalaBenchNonPredict {

   @Param(Array("1", "5", "10", "15", "20"))
   var lim: Int = _

+  var startDiv: Int = _
+  var startVal: Int = _
+
+  @Setup
+  def init() {
+    startDiv = 2
+    startVal = 10
+  }
+
   @tailrec private def isEvenlyDivisible(v: Int, div: Int, lim: Int): Boolean = {
     if (div > lim) true
     else (v % div == 0) && isEvenlyDivisible(v, div + 1, lim)
@@ -22,8 +33,8 @@

   @Benchmark
   def test(): Int = {
-    var v = 10
-    while(!isEvenlyDivisible(v, 2, lim))
+    var v = startVal
+    while(!isEvenlyDivisible(v, startDiv, lim))
       v += 2
     v
   }
--- JavaBench.java
+++ JavaBenchNonPredict.java
@@ -9,11 +20,20 @@
 @State(Scope.Benchmark)
-public class JavaBench {
+public class JavaBenchNonPredict {

     @Param({"1", "5", "10", "15", "20"})
     int lim;

+    int startDiv;
+    int startVal;
+
+    @Setup
+    public void setup() {
+        startDiv = 2;
+        startVal = 10;
+    }
+
     private boolean isEvenlyDivisible(int val, int div, int lim) {
         if (div > lim)
             return true;
@@ -23,10 +43,9 @@

     @Benchmark
     public int test() {
-        int val = 10;
-        while(!isEvenlyDivisible(val, 2, lim))
+        int val = startVal;
+        while(!isEvenlyDivisible(val, startDiv, lim))
             val += 2;
         return val;
     }
 }

…​and these are the results:

Benchmark                     (lim)   Mode  Samples        Score  Score error  Units
n.s.ScalaBenchNonPredict.test     1   avgt       15        0.002        0.000  us/op
n.s.ScalaBenchNonPredict.test     5   avgt       15        0.489        0.002  us/op
n.s.ScalaBenchNonPredict.test    10   avgt       15       23.777        0.116  us/op
n.s.ScalaBenchNonPredict.test    15   avgt       15     3379.870        5.737  us/op
n.s.ScalaBenchNonPredict.test    20   avgt       15  2468845.944     2413.573  us/op

Benchmark                     (lim)   Mode  Samples        Score  Score error  Units
n.s.JavaBenchNonPredict.test      1   avgt       15        0.003        0.000  us/op
n.s.JavaBenchNonPredict.test      5   avgt       15        0.465        0.001  us/op
n.s.JavaBenchNonPredict.test     10   avgt       15       23.989        0.095  us/op
n.s.JavaBenchNonPredict.test     15   avgt       15     3453.116       16.390  us/op
n.s.JavaBenchNonPredict.test     20   avgt       15  2518726.451     4374.482  us/op

Love and harmony all around. Scala benchmark is a little better because the generated code with closed method is a bit better than the one calling the recursive one. But in this particular example, the difference is miniscule.

Conclusion

Comparing two implementations in otherwise the same environment is hard on its own. Comparing two implementations coming from two different ecosystems, when both implementations have experienced quite a few transformations before reaching the execution units on your CPU is very hard. The minute differences the code encounters while going through its own life-cycle can drastically affect its performance, ruining the comparison.

That is why, the cross-language benchmarks should include a tough analysis on what was going on. The superficial conclusions almost always feed on existing biases, and are almost always wrong. Blindly believing some optimizations work the way you expect them to work is unwarranted. The juxtapositions and interference of optimizations may lead to completely surprising outcomes.

Benchmarks are for understanding the reality, not for reinforcing the prejudices one has. In this example, did we learn what is faster, Java or Scala? Nope. But we learned a lot digging for explanations why the results are different, which will hopefully result in having better cohesion between languages and the underlying platform.

Oh, and if you think that part is constructive and boring, and you came here for a language holy-war, I do think both Java and Scala suck as programming languages, because they both allow me to write stupid programs with performance problems. There.

Epilogue

There are two comment threads at Reddit and Hacker News. Remarkably, I have to explain the last sentence in Conclusion section for Hacker News crowd.