Aleksey Shipilёv, @shipilev, aleksey@shipilev.net
Preface
With a plethora of new languages coming and going, people frequently try to do the cross-language performance comparisons. In most cases, that’s a strange thing to do, because even the minute differences in implementations can drive the benchmark scores off the cliff. It’s difficult even to compare a single small change in the otherwise stable environment. Trying to do the same when language compilers, bytecode generators, optimization opportunities, API differences, and contract obligations may differ drastically, happens to be orders of magnitude more complicated.
In this post, we will explore one of Java vs Scala comparisons. This is the extended version of my answer in the relevant thread on StackOverlow, and here we would see how to approach these problems, and what conclusions one may draw from the experiments.
As a good tradition, we will take some diversions into benchmarking methodology, so that even though the post itself is targeted to platform people, the non-platform people can still learn a few tricks. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.
Thanks to Nitsan Wakart and Pavel Rappo for review!
Benchmark
The discussion in that particular StackOverflow thread dates back a few questions, so instead of digging there, we will just take the latest benchmark code, and wrap it up with JMH. JMH already has the bindings for Java and Scala, which somewhat alleviates the difference in testing methodology. You can find full benchmark code here (warning, it contains spoilers).
The original benchmark may be expressed with the help of JMH as follows:
@State(Scope.Benchmark)
class ScalaBench {
@Param(Array("1", "5", "10", "15", "20"))
var lim: Int = _
@tailrec private def isEvenlyDivisible(v: Int, div: Int, lim: Int): Boolean = {
if (div > lim) true
else (v % div == 0) && isEvenlyDivisible(v, div + 1, lim)
}
@Benchmark
def test(): Int = {
var v = 10
while(!isEvenlyDivisible(v, 2, lim))
v += 2
v
}
}
What does this thing do? This code tries to find the first even integer past 10, which is evenly divisible by all integers in [2..lim). isEvenlyDivisible
is the tail-recursive implementation of the search through div
.
The straight rewrite of the same benchmark in Java yields:
@State(Scope.Benchmark)
public class JavaBench {
@Param({"1", "5", "10", "15", "20"})
int lim;
private boolean isEvenlyDivisible(int val, int div, int lim) {
if (div > lim)
return true;
else
return (val % div == 0) && isEvenlyDivisible(val, div + 1, lim);
}
@Benchmark
public int test() {
int val = 10;
while(!isEvenlyDivisible(val, 2, lim))
val += 2;
return val;
}
}
Okay, now we have two seemingly similar benchmarks, let’s run them! Shall we? But before we do that, let us try our intuition and make some predictions. Most people would predict that since JVMs do not have tail recursion elimination yet and also are rather pessimistic with recursive inlining, the @tailrec
variant exploded by scalac itself would bring in the performance of ordinary looping. Therefore, one would conclude Scala benchmark should win big time over Java benchmark.
Good, let’s see. Running this on my laptop (2x2 i5-2520M, 2.0 GHz, Linux x86_64, JDK 8 GA) yields:
Benchmark (lim) Mode Samples Score Score error Units n.s.ScalaBench.test 1 avgt 15 0.002 0.000 us/op n.s.ScalaBench.test 5 avgt 15 0.494 0.005 us/op n.s.ScalaBench.test 10 avgt 15 24.228 0.268 us/op n.s.ScalaBench.test 15 avgt 15 3457.733 33.070 us/op n.s.ScalaBench.test 20 avgt 15 2505634.259 15366.665 us/op Benchmark (lim) Mode Samples Score Score error Units n.s.JavaBench.test 1 avgt 15 0.002 0.000 us/op n.s.JavaBench.test 5 avgt 15 0.252 0.001 us/op n.s.JavaBench.test 10 avgt 15 12.782 0.325 us/op n.s.JavaBench.test 15 avgt 15 1615.890 7.647 us/op n.s.JavaBench.test 20 avgt 15 1053187.731 20502.217 us/op
That’s odd: Scala benchmark is actually twice as slow. Consistently. At this point, people try to muck the benchmark into showing what they want it to show. The obvious thing to consider is inhibiting tail recursion elimination altogether. It is a bit confusing to also drop private
, but this is needed since scalac apparrently does tailrec elimination even for unannotated private methods.
--- ScalaBench.scala
+++ ScalaBenchNoTailrec.scala
@@ -10,12 +12,12 @@
@State(Scope.Benchmark)
-class ScalaBench {
+class ScalaBenchNoTailrec {
@Param(Array("1", "5", "10", "15", "20"))
var lim: Int = _
- @tailrec private def isEvenlyDivisible(v: Int, div: Int, lim: Int): Boolean = {
+ def isEvenlyDivisible(v: Int, div: Int, lim: Int): Boolean = {
if (div > lim) true
else (v % div == 0) && isEvenlyDivisible(v, div + 1, lim)
}
This gets the performance back to Java level:
Benchmark (lim) Mode Samples Score Score error Units n.s.ScalaBenchNoTailrec.test 1 avgt 15 0.002 0.000 us/op n.s.ScalaBenchNoTailrec.test 5 avgt 15 0.253 0.002 us/op n.s.ScalaBenchNoTailrec.test 10 avgt 15 12.482 0.056 us/op n.s.ScalaBenchNoTailrec.test 15 avgt 15 1615.506 11.100 us/op n.s.ScalaBenchNoTailrec.test 20 avgt 15 1055345.330 10848.845 us/op
Everyone is happy, except for everyone else who is sad. "Why would such a nice scalac optimization turn the code into slowish nightmare?", they would ask themselves.
Analyzing Bottlenecks
There are two basic approaches to analyze a benchmark:
-
You juggle the free variables and see how benchmark responds to them, thus doing positive and negative control to figure out whether your experimental setup is sane. The good examples of this approach are my previous posts "The Exceptional Performance of Lil' Exception" and "Nanotrusting the Nanotime". This method dubs the scientific approach from natural sciences where there is no better way to confirm the experimental setup.
-
You observe and poke around the benchmark while it is running, thus analyzing bottlenecks. Brendan Gregg calls it "Active Benchmarking". This is something that differs software engineering from natural science — here, we can disassemble/profile/observe all the layers down our artificial Universe to see what is happening. Physicists would go for a kill to have an opportunity like that in their field of study.
Here, we would explore the bottlenecks part of methodology. What could be easier than to profile a benchmark, right? There are plenty of profilers in the world, just pick one, and shoot. That’s not what you should start with though: before trying to profile the application itself, make sure that anything else is not a problem (high sys%, iowait%, etc: see our Performance Methodology talks). We will briefly mention that both Java and Scala benchmarks indeed runs in usr% nicely.
Java Stack Profilers
Let us start with some blunt tools. JMH ships with naive sampling profiler, added there for two reasons: a) fun; b) letting people quickly diagnose horrible mistakes in their benchmarks (also fun). Let’s activate it with -prof stack
, select a single lim with -p lim=10
, and see:
Java
Result: 12.719 ±(99.9%) 0.284 us/op [Average] Statistics: (min, avg, max) = (12.642, 12.719, 12.795), stdev = 0.074 Confidence interval (99.9%): [12.435, 13.003] ....[Thread state distributions]..................................................... 91.3% RUNNABLE 8.7% WAITING ....[Thread state: RUNNABLE]......................................................... 58.0% 63.5% net.shipilev.JavaBench.isEvenlyDivisible 32.9% 36.1% net.shipilev.JavaBench.test 0.2% 0.2% java.lang.Thread.currentThread 0.2% 0.2% sun.reflect.Reflection.getClassAccessFlags ....[Thread state: WAITING].......................................................... 8.7% 100.0% sun.misc.Unsafe.park
Scala
Result: 24.076 ±(99.9%) 0.728 us/op [Average] Statistics: (min, avg, max) = (23.930, 24.076, 24.403), stdev = 0.189 Confidence interval (99.9%): [23.348, 24.804] ....[Thread state distributions]..................................................... 91.4% RUNNABLE 8.6% WAITING ....[Thread state: RUNNABLE]......................................................... 90.6% 99.1% net.shipilev.ScalaBench.test 0.9% 0.9% net.shipilev.generated.ScalaBench_test.test_avgt_jmhLoop ....[Thread state: WAITING].......................................................... 8.6% 100.0% sun.misc.Unsafe.park
Okay, nothing very interesting here. It seems that in Scala version, isEvenlyDivisible
is folded and inlined into test
. Actually, this profile looks a bit wrong, because we would expect test
to be perfectly inlined into JMH’s …jmhLoop
method which contains the super-optimized generated code for the benchmark. We even force compiler to inline it there. But not to worry, this is what thread stacks are lying to us about: they show us where the thread is residing in some Java wonderland, not where it really is in the compiled code.
Hardware Counters Profilers
We need to go a level deeper. We have to tap into the very fabric of the benchmark. For things like this we have the ultimate "perfasm" profiler in JMH 0.9+, which combines the powers of Linux perf and internal JVM disassemblers. Let us unlock it with -prof perfasm
and look into the output. It normally dumps quite a bit of info, but we will digest it in parts. Since we have the printed assembly on our hands, thanks to -XX:+PrintAssembly, we can map those events back to Java code, and figure out what Java or native method those events belong to. The tables below contain the aggregated statistics from perf profiling for "cycles" and "instruction" events in first two columns, respectively.
Java
JavaBench had failed to inline isEvenlyDivisible
, because it is the recursive method:
....[Hottest Methods (after inlining)]............................................... 52.67% 50.95% net.shipilev.JavaBench::isEvenlyDivisible 44.11% 46.18% net.shipilev.generated.JavaBench_test::test_avgt_jmhLoop 1.52% 1.35% <kernel> 0.89% 0.94% sun.reflect.ClassFileAssembler::cpi 0.10% 0.11% vfprintf (libc-2.15.so) (...skipped...) ..................................................................................... 100.00% 99.89% <totals>
Scala
Here, we can see that ScalaBench is perfectly compiled into a single generated method, which means @tailrec
really worked!
....[Hottest Methods (after inlining)]............................................... 95.76% 96.42% net.shipilev.generated.ScalaBench_test::test_avgt_jmhLoop 2.30% 1.54% <kernel> 1.54% 1.75% java.io.DataOutputStream::writeInt 0.14% 0.10% print_insn (libhsdis-amd64.so) 0.10% 0.07% [unknown] (perf-3369.map) 0.06% 0.01% oappend (libhsdis-amd64.so) (...skipped...) ..................................................................................... 100.00% 99.96% <totals>
Still, Scala benchmark is slower, why? Now that we established the areas where our benchmark spends time, we can dive into assembly.
Hardware Counters + PrintAssembly
It really helps to have megabytes of assembly to be annotated with perf event counters, because it contrasts the part we should actually digest. Good hardware-counter-enabled profilers do that (Solaris Studio Performance Analyzer, for example, does that for Java code nicely). JMH’s perfasm does that, see below.
Scala
Here, this is Scala’s hottest region, dominating with 95.76% CPU cycles:
....[Hottest Region 1].......................................................................
[0x7fbb20f8574f:0x7fbb20f85782] in net.shipilev.generated.ScalaBench_test::test_avgt_jmhLoop
0x00007fbb20f85747: pop %rbp
0x00007fbb20f85748: test %eax,0x15d3a8b2(%rip)
0x00007fbb20f8574e: retq
; <loop starts>
2.57% 0.20% 0x00007fbb20f8574f: add $0x2,%r11d
0.01% 0.01% 0x00007fbb20f85753: test %eax,0x15d3a8a7(%rip)
0.43% 0.70% 0x00007fbb20f85759: mov $0x2,%ecx
0.01% 0.01% 0x00007fbb20f8575e: xchg %ax,%ax
2.29% 3.12% 0x00007fbb20f85760: test %ecx,%ecx
0x00007fbb20f85762: je 0x00007fbb20f8581f
0.14% 0x00007fbb20f85768: mov %r11d,%eax
4.44% 6.91% 0x00007fbb20f8576b: cmp $0x80000000,%eax
0x00007fbb20f85770: jne 0x00007fbb20f85779
0x00007fbb20f85772: xor %edx,%edx
0x00007fbb20f85774: cmp $0xffffffffffffffff,%ecx
0x00007fbb20f85777: je 0x00007fbb20f8577c
; <INTEGER DIVISION>
0.27% 0.17% 0x00007fbb20f85779: cltd
2.24% 17.26% 0x00007fbb20f8577a: idiv %ecx
77.99% 66.44% 0x00007fbb20f8577c: test %edx,%edx
; <back branch for the loop>
0x00007fbb20f8577e: jne 0x00007fbb20f8574f
4.56% 1.41% 0x00007fbb20f85780: inc %ecx
0.72% 0.13% 0x00007fbb20f85782: cmp %r10d,%ecx
0x00007fbb20f85785: jl 0x00007fbb20f85760
0x00007fbb20f85787: mov 0xb0(%r9),%r10d
0x00007fbb20f8578e: mov 0xb4(%r9),%edx
We see the very tight loop here, and the integer division is the largest consumer of CPU cycles. How can you possibly beat this twice in throughput, like Java benchmark does? How on Earth can you run idiv
any faster?
Java
Let us dive into Java benchmark assembly to answer that. The assembly for hot parts is larger, because both tail-recursion and recursive inlining of isEvenlyDivisible
had not happened. Let us see isEvenlyDivisible
part first. I skipped the irrelevant parts from the assembly, and also pruned some of the non-essential comments:
....[Hottest Region 1].......................................................................
[0x7fa7c5196840:0x7fa7c51968b5] in net.shipilev.JavaBench::isEvenlyDivisible
(...skipped...)
; <INTEGER DIVISION>
1.68% 2.76% 0x00007fa7c5196868: cltd
0.06% 0.16% 0x00007fa7c5196869: idiv %ecx
27.59% 36.37% 0x00007fa7c519686b: test %edx,%edx
0x00007fa7c519686d: jne 0x00007fa7c51968a8
(...skipped...)
; <INTEGER DIVISION>
0.04% 0x00007fa7c5196891: cltd
0x00007fa7c5196892: idiv %r10d
12.24% 1.54% 0x00007fa7c5196895: test %edx,%edx
0x00007fa7c5196897: jne 0x00007fa7c51968a8
(...skipped...)
; <RECURSIVE CALL TO isEvenlyDivisible()>
0.01% 0x00007fa7c519689f: callq 0x00007fa7c5045d60
0.04% 0x00007fa7c51968a4: test %eax,%eax
(...skipped...)
Here, we see two idiv
-s which share the CPU cycles, and accounting for ~40% of cycles, and then the recursive call back to us. Nothing magical is there, except for two idiv
-s, but that is because we had inlined the first call to isEvenlyDivisible
, and then stopped:
net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) @ 19 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) inline (hot) @ 19 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) recursive inlining is too deep
Okay, moving on to the main benchmark method, which had inlined test
internally:
[0x7fa7c51a6264:0x7fa7c51a630a] in net.shipilev.generated.JavaBench_test::test_avgt_jmhLoop
(...skipped...)
1.32% 0.11% 0x00007fa7c51a626e: movslq %r9d,%rdx
1.15% 1.77% 0x00007fa7c51a6271: mov %r9d,%r10d
0.88% 1.31% 0x00007fa7c51a6274: sar $0x1f,%r10d
; <INTEGER "DIVISION">
1.34% 0.21% 0x00007fa7c51a6278: imul $0x55555556,%rdx,%rdx
1.25% 0.20% 0x00007fa7c51a627f: sar $0x20,%rdx
1.15% 2.36% 0x00007fa7c51a6283: mov %edx,%esi
0.95% 1.51% 0x00007fa7c51a6285: sub %r10d,%esi ; irem
1.79% 0.50% 0x00007fa7c51a6288: mov %esi,%edx
2.12% 1.87% 0x00007fa7c51a628a: shl %edx
1.72% 3.15% 0x00007fa7c51a628c: add %esi,%edx
1.55% 2.59% 0x00007fa7c51a628e: sub %edx,%edi
2.90% 2.61% 0x00007fa7c51a6290: add $0x2,%edi
(...skipped...)
; <CALL TO isEvenlyDivisible()>
1.58% 2.99% 0x00007fa7c51a62b7: callq 0x00007fa7c5045d60
0.18% 0.34% 0x00007fa7c51a62bc: test %eax,%eax
(...skipped...)
Wait. I see the reference to irem
there, which is a bytecode instruction for integer remainder, but where is idiv
? This is actually a nice trick from Hacker’s Delight, Ch. 10, Sect. 19. The trick is to compute remainder for a given divisor by multiplying to the reciprocal of that divisor, and shifting right to recover the remainder bits. HD has a nice section to explain how that works.
Hypothesis, Theory, Methodology
Embracing A Hunch
Okay, we now figured that the difference between Java benchmark and Scala benchmark is in the way we compute the remainder op. That optimization is only possible when you know the divisor, how’s that Java benchmark had figured it out, and Scala bench did not? It gets rather obvious when you look into inlining trees (available with -XX:+PrintInlining, and in JMH saved perfasm output):
net.shipilev.generated.JavaBench_test::test_avgt_jmhLoop (55 bytes) @ 7 java.lang.System::nanoTime (0 bytes) (intrinsic) @ 16 net.shipilev.JavaBench::test (24 bytes) force inline by CompilerOracle @ 10 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) inline (hot) @ 19 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) inline (hot) @ 10 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) inline (hot) @ 19 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) inline (hot) @ 19 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) recursive inlining is too deep @ 19 org.openjdk.jmh.infra.Blackhole::consume (39 bytes) force inline by CompilerOracle @ 16 net.shipilev.JavaBench::test (24 bytes) force inline by CompilerOracle @ 10 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) inline (hot) @ 19 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) inline (hot) @ 10 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) inline (hot) @ 19 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) inline (hot) @ 19 net.shipilev.JavaBench::isEvenlyDivisible (31 bytes) recursive inlining is too deep @ 19 org.openjdk.jmh.infra.Blackhole::consume (39 bytes) force inline by CompilerOracle @ 36 java.lang.System::nanoTime (0 bytes) (intrinsic)
We are able to inline a few first slabs of isEvenlyDivisible
, and then we compile those slabs knowing the divisor on first iterations! The generic division is there in inEvenlyDivisible
method body, where divisor is read from the method argument, and therefore unpredictable.
Scala version looks neat and optimized:
net.shipilev.generated.ScalaBench_test::test_avgt_jmhLoop (55 bytes) @ 7 java.lang.System::nanoTime (0 bytes) (intrinsic) @ 16 net.shipilev.ScalaBench::test (25 bytes) force inline by CompilerOracle @ 7 net.shipilev.ScalaBench::lim (5 bytes) accessor @ 10 net.shipilev.ScalaBench::isEvenlyDivisible (29 bytes) inline (hot) @ 7 net.shipilev.ScalaBench::lim (5 bytes) accessor @ 10 net.shipilev.ScalaBench::isEvenlyDivisible (29 bytes) inline (hot) @ 19 org.openjdk.jmh.infra.Blackhole::consume (39 bytes) force inline by CompilerOracle @ 16 net.shipilev.ScalaBench::test (25 bytes) force inline by CompilerOracle @ 7 net.shipilev.ScalaBench::lim (5 bytes) accessor @ 10 net.shipilev.ScalaBench::isEvenlyDivisible (29 bytes) inline (hot) @ 7 net.shipilev.ScalaBench::lim (5 bytes) accessor @ 10 net.shipilev.ScalaBench::isEvenlyDivisible (29 bytes) inline (hot) @ 19 org.openjdk.jmh.infra.Blackhole::consume (39 bytes) force inline by CompilerOracle @ 36 java.lang.System::nanoTime (0 bytes) (intrinsic)
…but that’s the optimization turned sour. We indeed folded the recursion in the loop, but we were not able to optimize for starting divisors, see the assembly for ScalaBench loop above.
Modeling The Effect
So, we hypothesed peeling for first divisors should work. Let’s try to observe this effect in more controlled environment. Oh, I like to play an optimizing compiler on my own. That is, take this method as the example:
@Benchmark
public int test_02() {
forval: for (int val = 10; ; val += 2) {
for (int div = 2; div < lim; div++) {
if (val % div != 0) continue forval;
}
return val;
}
}
…and try to peel the first iterations from the loop, like this:
@Benchmark
public int test_05() {
forval: for (int val = 10; ; val += 2) {
if (val % 2 != 0) continue;
if (val % 3 != 0) continue;
if (val % 4 != 0) continue;
for (int div = 5; div < lim; div++) {
if (val % div != 0) continue forval;
}
return val;
}
}
We can produce multiple methods like this test_N
, where N
is the starting div
. It is not, strictly speaking, the equivalent transformation when N
> lim
because we do the excess futile divisions, but we can workaround that by measuring test_02
..test_10
only for lim
=[10..20]
. There:
Benchmark (lim) Mode Samples Score Score error Units n.s.JavaBenchPeeling.test_02 10 avgt 15 23.793 0.094 us/op n.s.JavaBenchPeeling.test_03 10 avgt 15 15.439 0.048 us/op n.s.JavaBenchPeeling.test_04 10 avgt 15 9.372 0.047 us/op n.s.JavaBenchPeeling.test_05 10 avgt 15 6.800 0.034 us/op n.s.JavaBenchPeeling.test_06 10 avgt 15 5.589 0.024 us/op n.s.JavaBenchPeeling.test_07 10 avgt 15 6.086 0.058 us/op n.s.JavaBenchPeeling.test_08 10 avgt 15 5.879 0.029 us/op n.s.JavaBenchPeeling.test_09 10 avgt 15 5.890 0.059 us/op n.s.JavaBenchPeeling.test_10 10 avgt 15 5.369 0.075 us/op n.s.JavaBenchPeeling.test_02 15 avgt 15 3393.344 15.704 us/op n.s.JavaBenchPeeling.test_03 15 avgt 15 2214.234 2.848 us/op n.s.JavaBenchPeeling.test_04 15 avgt 15 1371.927 53.220 us/op n.s.JavaBenchPeeling.test_05 15 avgt 15 956.982 1.420 us/op n.s.JavaBenchPeeling.test_06 15 avgt 15 891.443 76.253 us/op n.s.JavaBenchPeeling.test_07 15 avgt 15 784.682 4.270 us/op n.s.JavaBenchPeeling.test_08 15 avgt 15 763.488 7.196 us/op n.s.JavaBenchPeeling.test_09 15 avgt 15 869.820 3.332 us/op n.s.JavaBenchPeeling.test_10 15 avgt 15 767.905 4.435 us/op n.s.JavaBenchPeeling.test_02 20 avgt 15 2469200.228 1702.810 us/op n.s.JavaBenchPeeling.test_03 20 avgt 15 1551785.134 4022.444 us/op n.s.JavaBenchPeeling.test_04 20 avgt 15 883515.951 4080.213 us/op n.s.JavaBenchPeeling.test_05 20 avgt 15 626238.203 10623.455 us/op n.s.JavaBenchPeeling.test_06 20 avgt 15 565646.685 3103.268 us/op n.s.JavaBenchPeeling.test_07 20 avgt 15 509773.855 4233.588 us/op n.s.JavaBenchPeeling.test_08 20 avgt 15 492926.289 2542.706 us/op n.s.JavaBenchPeeling.test_09 20 avgt 15 562482.076 2269.882 us/op n.s.JavaBenchPeeling.test_10 20 avgt 15 494944.558 2444.475 us/op
We see the peeling really helps up to test_06
, which contains four peeled iterations. We know the divisors for those iterations exactly, and so we generate the optimized code for them. The improvements past test_06
are diminishing quickly because more and more val
-s get filtered early with short-cut checks, to the point each new check is not contributing to execution time any more — you can see that in annotated profiles, but this post is already too long.
The Methodology Quirk
Our theory explains why Scala benchmark without @tailrec
is faster: we put everything in a single method, which helped us to avoid the excessive recursive call, but at the same time it robbed us of opportunity to peel the first predictable computations!
Now, this is arguably a benchmarking quirk. Most people think about dead-code elimination is the largest impediment for a good benchmarking. They usually forget that compilers can not only figure out the computation is redundant because of unclaimed outputs, but also fold some redundant computations because of the predictable inputs! JMH samples warn you about this. JVMLS talks warn you about this. And yet, everybody fails from time to time. Experienced guys fail less frequently though.
Okay, how do we fix the benchmarks JMH-style? Easy, read the values from heap, and let JMH to take care of the rest:
--- ScalaBench.scala
+++ ScalaBenchNonPredict.scala
@@ -10,11 +12,20 @@
@State(Scope.Benchmark)
-class ScalaBench {
+class ScalaBenchNonPredict {
@Param(Array("1", "5", "10", "15", "20"))
var lim: Int = _
+ var startDiv: Int = _
+ var startVal: Int = _
+
+ @Setup
+ def init() {
+ startDiv = 2
+ startVal = 10
+ }
+
@tailrec private def isEvenlyDivisible(v: Int, div: Int, lim: Int): Boolean = {
if (div > lim) true
else (v % div == 0) && isEvenlyDivisible(v, div + 1, lim)
@@ -22,8 +33,8 @@
@Benchmark
def test(): Int = {
- var v = 10
- while(!isEvenlyDivisible(v, 2, lim))
+ var v = startVal
+ while(!isEvenlyDivisible(v, startDiv, lim))
v += 2
v
}
--- JavaBench.java
+++ JavaBenchNonPredict.java
@@ -9,11 +20,20 @@
@State(Scope.Benchmark)
-public class JavaBench {
+public class JavaBenchNonPredict {
@Param({"1", "5", "10", "15", "20"})
int lim;
+ int startDiv;
+ int startVal;
+
+ @Setup
+ public void setup() {
+ startDiv = 2;
+ startVal = 10;
+ }
+
private boolean isEvenlyDivisible(int val, int div, int lim) {
if (div > lim)
return true;
@@ -23,10 +43,9 @@
@Benchmark
public int test() {
- int val = 10;
- while(!isEvenlyDivisible(val, 2, lim))
+ int val = startVal;
+ while(!isEvenlyDivisible(val, startDiv, lim))
val += 2;
return val;
}
}
…and these are the results:
Benchmark (lim) Mode Samples Score Score error Units n.s.ScalaBenchNonPredict.test 1 avgt 15 0.002 0.000 us/op n.s.ScalaBenchNonPredict.test 5 avgt 15 0.489 0.002 us/op n.s.ScalaBenchNonPredict.test 10 avgt 15 23.777 0.116 us/op n.s.ScalaBenchNonPredict.test 15 avgt 15 3379.870 5.737 us/op n.s.ScalaBenchNonPredict.test 20 avgt 15 2468845.944 2413.573 us/op Benchmark (lim) Mode Samples Score Score error Units n.s.JavaBenchNonPredict.test 1 avgt 15 0.003 0.000 us/op n.s.JavaBenchNonPredict.test 5 avgt 15 0.465 0.001 us/op n.s.JavaBenchNonPredict.test 10 avgt 15 23.989 0.095 us/op n.s.JavaBenchNonPredict.test 15 avgt 15 3453.116 16.390 us/op n.s.JavaBenchNonPredict.test 20 avgt 15 2518726.451 4374.482 us/op
Love and harmony all around. Scala benchmark is a little better because the generated code with closed method is a bit better than the one calling the recursive one. But in this particular example, the difference is miniscule.
Conclusion
Comparing two implementations in otherwise the same environment is hard on its own. Comparing two implementations coming from two different ecosystems, when both implementations have experienced quite a few transformations before reaching the execution units on your CPU is very hard. The minute differences the code encounters while going through its own life-cycle can drastically affect its performance, ruining the comparison.
That is why, the cross-language benchmarks should include a tough analysis on what was going on. The superficial conclusions almost always feed on existing biases, and are almost always wrong. Blindly believing some optimizations work the way you expect them to work is unwarranted. The juxtapositions and interference of optimizations may lead to completely surprising outcomes.
Benchmarks are for understanding the reality, not for reinforcing the prejudices one has. In this example, did we learn what is faster, Java or Scala? Nope. But we learned a lot digging for explanations why the results are different, which will hopefully result in having better cohesion between languages and the underlying platform.
Oh, and if you think that part is constructive and boring, and you came here for a language holy-war, I do think both Java and Scala suck as programming languages, because they both allow me to write stupid programs with performance problems. There.
Epilogue
There are two comment threads at Reddit and Hacker News. Remarkably, I have to explain the last sentence in Conclusion section for Hacker News crowd.