Aleksey Shipilёv, @shipilev, aleksey@shipilev.net
Preface
The post you’re reading is loosely based on two Russian blog posts (1) (2) I wrote a few years ago. Since that time, the JMH microbenchmarking framework has been released, making it possible to render the previous discussion more rigorous and quantitative. Those posts began when someone had asked me about the performance of the try-catch construct, and the performance implications of Exceptions in general. In this English version we unify and extend the previous Russian posts.
We would like to thank Marcus Lagergren for quick sanity checks on the post, and Jerry Kuch for lots of editorial comments on my non-native English. Thanks guys!
Along the way, we will take some diversions into benchmarking methodology, effectively making this post a primer on quantifying the performance of language features in a concrete implementation. If you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.
Brace yourselves and good luck.
Lil' Exception is Conceived
Let us start with the exceptions themselves. This is the sample Exception we will be dealing with thorough the post:
package net.shipilev.perf.exceptions;
public class LilException extends Exception {
private final int metadata;
public LilException(int metadata) {
super();
this.metadata = metadata;
}
public LilException(LilException e, int metadata) {
super(e);
this.metadata = metadata;
}
public int getMetadata() {
return metadata;
}
}
Note that the above exception contains the LilException.metadata
field,
which vaguely simulates the sort of auxiliary information that might be
placed in an Exception upon its creation. LilException.metadata
is
important for our benchmarking purposes, because hollow exceptions behave
a little differently, as you will see below.
Lil' Exception’s Prospects of Life
Rationale
To get started, we should first do some very basic benchmarks demonstrating throwing and immediately catching the exceptions in a simple try-catch body. You can find the complete code for this benchmark here, but we will highlight the important points below:
-
First of all, let’s consider a few scenarios, namely: returning a value normally, throwing the dynamic exception, throwing the static exception. Note that all of our benchmark targets are marked with the
@CompilerControl
annotation to force inlining. If you took my earlier advice and looked around the JMH project, you may recall that@CompilerControl
is a VM-specific mechanism for influencing how the JVM’s JIT compiler will treat our benchmark methods.
@CompilerControl(CompilerControl.Mode.INLINE)
public int doSomething() {
return source;
}
@CompilerControl(CompilerControl.Mode.INLINE)
public int doSomething_Exception() throws LilException {
throw new LilException(source);
}
@CompilerControl(CompilerControl.Mode.INLINE)
public int doSomething_Exception_Static() throws LilException {
throw staticException;
}
-
Static exceptions are exception instances which are pre-cached in a field. In practice, such usages might be considered either a code smell or a performance optimization, depending on context, but for our current purposes they’re important as a control, providing an additional reference point against which to validate our thinking about exception performance:
@State(Scope.Thread)
public class BasicBench {
LilException staticException;
int source;
@Setup(Level.Iteration)
public void setup() {
staticException = new LilException(source);
}
...
-
We choose to instantiate
LilException
using thesource
field in target methods deliberately: under JMH rules, constant optimizations for thesource
field are inhibited, preventing the runtime from caching the exception. In practice, the runtime would not do perform this caching today, even without thesource
argument used as above, but in benchmarking, you’re better safe than sorry: if you can explicitly prevent a runtime optimization that interferes with the phenomenon you seek to measure, do so. -
The returned values are always returned from methods annotated with
@GenerateMicroBenchmark
, which inhibits the JVM’s dead-code elimination optimizations. This is how it looks in the code:
@GenerateMicroBenchmark
public int plain() {
return doSomething();
}
-
There are additional tests which use either
LilException.metadata
or the stack trace. -
There are even more tests doing the same thing, but using the special
-XX:-StackTraceInThrowable
JVM option, which skips the filling in of stack trace information. Again, this is not considered a good practice in production use, but it is important as additional control. Some guys are trying to get the same behavior narrowed to particular exceptions by overridingfillInStackTrace()
, we will try that later.
Results
Running the described test on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield:
Benchmark Mode Samples Mean Mean error Units
BasicBench.dynamicException avgt 25 1901.196 14.572 ns/op
BasicBench.dynamicException_NoStack avgt 25 67.029 0.212 ns/op
BasicBench.dynamicException_NoStack_UsedData avgt 25 68.952 0.441 ns/op
BasicBench.dynamicException_NoStack_UsedStack avgt 25 137.329 1.039 ns/op
BasicBench.dynamicException_UsedData avgt 25 1900.770 9.359 ns/op
BasicBench.dynamicException_UsedStack avgt 25 20033.658 118.600 ns/op
BasicBench.plain avgt 25 1.259 0.002 ns/op
BasicBench.staticException avgt 25 1.510 0.001 ns/op
BasicBench.staticException_NoStack avgt 25 1.514 0.003 ns/op
BasicBench.staticException_NoStack_UsedData avgt 25 4.185 0.015 ns/op
BasicBench.staticException_NoStack_UsedStack avgt 25 19.110 0.051 ns/op
BasicBench.staticException_UsedData avgt 25 4.159 0.007 ns/op
BasicBench.staticException_UsedStack avgt 25 25.144 0.186 ns/op
Ok, what do we see here? I’m playing Captain Hindsight a bit, for otherwise this post would be the chronicle of a two-day back-and-forth journey.
-
First, and foremost, observe that the baseline
plain
test executes in just over 1 ns, which is the physical limit for my hardware to add up a long, read the boolean from memory, and do a branch. (It is remarkable how many benchmark harnesses fail or refuse to run this basic test). -
The
staticException_*
tests are almost as fast asplain
. This is because the target methods had inlined, and we do not do any additional heavy-weight work for exceptions. The exceptions are reduced to just control-flow effects. -
The
dynamicException_*
tests are not as happy. We can see that the process of raising the exception is rather costly, varying over several orders of magnitude beyond what the cost ofplain
. -
Observe the difference between
dynamicException
anddynamicException_NoStack
. These dramatically illustrate the cost of creating the VM view of the stack trace. -
The contrast between
dynamicException
anddynamicException_UsedStack
further dramatizes these costs: if we then request the stack trace, we force the runtime to translate the VM stack trace into the Java stack trace, which costs a lot the first time around. -
The translated Java stack trace is cached, as you can see from
staticException_UsedStack
, which did the first lazy conversion during the warmup. -
Note that the
*_UsedStack
tests still have to defensively copy out theStackTraceElement[]
array, which explains minute differences betweenstaticException_UsedStack
andstaticException_NoStack_UsedStack
. In the latter case, the array is shorter, because there hardly any stack trace info to copy out. -
The same is true for the difference between
dynamicException_UsedStack
anddynamicException_NoStack_UsedStack
. No stack trace, no performance problem! -
The
*UsedData
tests are showcasing what happens when we use the exception to pass metadata around. In this case, exception objects are required to hold the metadata, although in the case of this particular benchmark the exception instances can _potentially be scalarized (in my runs, they really didn’t, because we skip inlining of exception methods, and escape analysis breaks).
At this point, some people might stop digging, and flee, thinking exceptions are so damn costly, they should not be used anywhere. I especially loathe bringing up the performance arguments in the API flamewars about exceptions. But we are not like those people. We understand there should be conditions where the feature is actually nicely applicable. Before we move on and set up the good experiment, we need to further understand how these beasts behave. This will help us to devise better benchmarks.
Lil' Exception is Born
Rationale
Exceptions catch the stack trace at the instantiation point. This feature is so damn useful, that I often find myself debugging with exceptions to see the backtraces to some chunk of code under test. Note that you don’t even have to throw the exception to make it useful:
new Exception().printStackTrace();
In the above, the stack trace is filled when exception is instantiated, so the cost of instantiating the exception should be dependent on how deep we are in the call stack. Here is the simple benchmark for the case like that. That benchmark relies on a simple template method, which is then copy-pasted to specialize for XXXX. This is a moderately ugly way to do a quick benchmark. We’ll see how to do better later. Here is our benchmark code, so far:
Integer source = 42;
@GenerateMicroBenchmark
public Object exception_XXXX() {
return call(XXXX, true);
}
@GenerateMicroBenchmark
public Object returned_XXXX() {
return call(XXXX, false);
}
public Object call(int depth, boolean except) {
if (depth == 0) {
if (except) {
return new LilException(source);
} else {
return source;
}
} else {
return call(depth - 1, except);
}
}
Note several things about this benchmark:
-
All inputs are always dependent on
source
, which breaks constant optimizations as before. -
All outputs are sinked back to avoid dead-code elimination. Doing so required us to use the most generic
Object
, because we need bothLilException
andInteger
as the possible targets. That also required us to bumpsource
fromint
toInteger
to pay the upfront cost of auto-boxing before the benchmark. -
There is the
returned_*
benchmark for the non-exceptional cases, to quantify the effect of calling multiple methods in a row; if that effect is on the same scale as the effect we are after, the benchmark is moot. -
A non-lazy person would further specialize
call
for exceptional and non-exceptional cases, but I opted instead to review the final generated code. -
A more cautious person might be inclined to obsess about tail-call optimizations here. TCO is not yet implemented in our JVM, and I checked in the generated assembly it did not happen.
-
An even more cautious person will obsess about inlining; actually, recursive inlining in current HotSpot versions is effectively banned, and I verified that call is not inlined. This is actually a good thing for this experiment, because we don’t have to count in the inlining things, as we would have to do later.
Results
Running this on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield:
Benchmark Mode Samples Mean Mean error Units
StackTraceConstructBench.exception_0000 avgt 25 1959.068 30.783 ns/op
StackTraceConstructBench.exception_0001 avgt 25 1945.958 12.104 ns/op
StackTraceConstructBench.exception_0002 avgt 25 2063.575 47.708 ns/op
StackTraceConstructBench.exception_0004 avgt 25 2211.882 29.417 ns/op
StackTraceConstructBench.exception_0008 avgt 25 2472.729 57.336 ns/op
StackTraceConstructBench.exception_0016 avgt 25 2950.847 29.863 ns/op
StackTraceConstructBench.exception_0032 avgt 25 4416.548 50.340 ns/op
StackTraceConstructBench.exception_0064 avgt 25 6845.140 40.114 ns/op
StackTraceConstructBench.exception_0128 avgt 25 11774.758 54.299 ns/op
StackTraceConstructBench.exception_0256 avgt 25 21617.526 101.379 ns/op
StackTraceConstructBench.exception_0512 avgt 25 42780.434 144.594 ns/op
StackTraceConstructBench.exception_1024 avgt 25 82839.358 291.434 ns/op
StackTraceConstructBench.returned_0000 avgt 25 4.551 0.005 ns/op
StackTraceConstructBench.returned_0001 avgt 25 4.552 0.006 ns/op
StackTraceConstructBench.returned_0002 avgt 25 7.079 0.019 ns/op
StackTraceConstructBench.returned_0004 avgt 25 10.113 0.024 ns/op
StackTraceConstructBench.returned_0008 avgt 25 16.185 0.068 ns/op
StackTraceConstructBench.returned_0016 avgt 25 28.891 0.048 ns/op
StackTraceConstructBench.returned_0032 avgt 25 53.466 0.131 ns/op
StackTraceConstructBench.returned_0064 avgt 25 119.373 0.270 ns/op
StackTraceConstructBench.returned_0128 avgt 25 221.252 0.485 ns/op
StackTraceConstructBench.returned_0256 avgt 25 433.791 1.337 ns/op
StackTraceConstructBench.returned_0512 avgt 25 840.277 8.315 ns/op
StackTraceConstructBench.returned_1024 avgt 25 1686.329 5.122 ns/op
Hence, our hypothesis is validated: the deeper the call stack, the higher
the cost of exception instantiation. It is actually quite scary to waste 80
microseconds to instantiate the exception buried down in 1024 calls.
Hi there, Spring! Hi there, appservers! Let us bisect that effect a bit
by doing -XX:-StackTraceInThrowable
:
Benchmark Mode Samples Mean Mean error Units
StackTraceConstructBench.exception_0000 avgt 25 69.262 0.301 ns/op
StackTraceConstructBench.exception_0001 avgt 25 69.271 0.377 ns/op
StackTraceConstructBench.exception_0002 avgt 25 72.484 0.533 ns/op
StackTraceConstructBench.exception_0004 avgt 25 75.964 0.630 ns/op
StackTraceConstructBench.exception_0008 avgt 25 84.303 0.868 ns/op
StackTraceConstructBench.exception_0016 avgt 25 99.583 0.746 ns/op
StackTraceConstructBench.exception_0032 avgt 25 183.691 2.354 ns/op
StackTraceConstructBench.exception_0064 avgt 25 296.708 3.317 ns/op
StackTraceConstructBench.exception_0128 avgt 25 536.303 8.986 ns/op
StackTraceConstructBench.exception_0256 avgt 25 990.993 3.446 ns/op
StackTraceConstructBench.exception_0512 avgt 25 1928.261 5.526 ns/op
StackTraceConstructBench.exception_1024 avgt 25 6085.115 13.070 ns/op
StackTraceConstructBench.returned_0000 avgt 25 4.553 0.010 ns/op
StackTraceConstructBench.returned_0001 avgt 25 4.556 0.010 ns/op
StackTraceConstructBench.returned_0002 avgt 25 7.088 0.046 ns/op
StackTraceConstructBench.returned_0004 avgt 25 10.305 0.286 ns/op
StackTraceConstructBench.returned_0008 avgt 25 16.211 0.061 ns/op
StackTraceConstructBench.returned_0016 avgt 25 28.905 0.085 ns/op
StackTraceConstructBench.returned_0032 avgt 25 53.426 0.262 ns/op
StackTraceConstructBench.returned_0064 avgt 25 119.256 0.299 ns/op
StackTraceConstructBench.returned_0128 avgt 25 221.135 0.652 ns/op
StackTraceConstructBench.returned_0256 avgt 25 433.753 0.805 ns/op
StackTraceConstructBench.returned_0512 avgt 25 835.677 0.917 ns/op
StackTraceConstructBench.returned_1024 avgt 25 1685.756 4.288 ns/op
Thus, we conclude that the dominating performance effect is the filling in of the stack trace. There is a second-order effect on par with the cost of calling the method chain itself, but we will leave that for the future. (It is actually cruel, because it requires heavy digging in all of the special cases the VM has for exception handling).
Another question is, does the cost depend on the depth of Java call stack, or the native call stack? That is, do we spend more time populating the Java view of call stack, or we actually wander around compiled code call stacks to figure out what’s where? We can check that with more aggressive inlining, but that will also require us to unroll the the call chain in the new benchmark, to get rid of the recursive call. See, we were lazy before, and now we ended up with two benchmarks instead of one.
Running with -XX:-Inline
will make Java stack look roughly the same as the
native stack:
Benchmark Mode Samples Mean Mean error Units
StackTracePlainBench.exception_0000 avgt 25 2463.298 58.179 ns/op
StackTracePlainBench.exception_0001 avgt 25 2588.883 44.302 ns/op
StackTracePlainBench.exception_0002 avgt 25 2671.243 62.618 ns/op
StackTracePlainBench.exception_0004 avgt 25 2814.950 16.428 ns/op
StackTracePlainBench.exception_0008 avgt 25 3279.462 56.387 ns/op
StackTracePlainBench.exception_0016 avgt 25 4254.165 96.825 ns/op
StackTracePlainBench.exception_0032 avgt 25 6448.725 25.428 ns/op
StackTracePlainBench.exception_0064 avgt 25 10604.247 27.803 ns/op
StackTracePlainBench.exception_0128 avgt 25 18937.581 220.683 ns/op
StackTracePlainBench.exception_0256 avgt 25 38333.533 382.913 ns/op
StackTracePlainBench.exception_0512 avgt 25 80801.420 854.231 ns/op
StackTracePlainBench.returned_0000 avgt 25 12.449 0.035 ns/op
StackTracePlainBench.returned_0001 avgt 25 14.893 0.054 ns/op
StackTracePlainBench.returned_0002 avgt 25 18.096 1.068 ns/op
StackTracePlainBench.returned_0004 avgt 25 23.046 0.068 ns/op
StackTracePlainBench.returned_0008 avgt 25 34.278 0.544 ns/op
StackTracePlainBench.returned_0016 avgt 25 68.957 0.255 ns/op
StackTracePlainBench.returned_0032 avgt 25 316.878 6.055 ns/op
StackTracePlainBench.returned_0064 avgt 25 844.046 5.821 ns/op
StackTracePlainBench.returned_0128 avgt 25 1927.618 19.176 ns/op
StackTracePlainBench.returned_0256 avgt 25 5280.506 101.069 ns/op
StackTracePlainBench.returned_0512 avgt 25 14089.720 48.708 ns/op
This is similar to what we have seen with StackTraceConstruct
benchmark.
Now that we established this similarity, let us deeply inline the method
chain with -XX:MaxInlineLevel=1500
. This will collapse the native stack,
while keeping the Java stack the same:
Benchmark Mode Samples Mean Mean error Units
StackTracePlainBench.exception_0000 avgt 25 1955.621 29.474 ns/op
StackTracePlainBench.exception_0001 avgt 25 1958.101 12.951 ns/op
StackTracePlainBench.exception_0002 avgt 25 1991.971 33.410 ns/op
StackTracePlainBench.exception_0004 avgt 25 2048.335 9.365 ns/op
StackTracePlainBench.exception_0008 avgt 25 2220.598 11.105 ns/op
StackTracePlainBench.exception_0016 avgt 25 2490.036 26.720 ns/op
StackTracePlainBench.exception_0032 avgt 25 3316.666 30.743 ns/op
StackTracePlainBench.exception_0064 avgt 25 4666.447 22.221 ns/op
StackTracePlainBench.exception_0128 avgt 25 7380.248 38.103 ns/op
StackTracePlainBench.exception_0256 avgt 25 13456.551 75.191 ns/op
StackTracePlainBench.exception_0512 avgt 25 34542.346 884.181 ns/op
StackTracePlainBench.returned_0000 avgt 25 4.547 0.006 ns/op
StackTracePlainBench.returned_0001 avgt 25 4.553 0.017 ns/op
StackTracePlainBench.returned_0002 avgt 25 4.546 0.003 ns/op
StackTracePlainBench.returned_0004 avgt 25 4.548 0.006 ns/op
StackTracePlainBench.returned_0008 avgt 25 4.546 0.006 ns/op
StackTracePlainBench.returned_0016 avgt 25 4.551 0.023 ns/op
StackTracePlainBench.returned_0032 avgt 25 4.547 0.004 ns/op
StackTracePlainBench.returned_0064 avgt 25 4.545 0.003 ns/op
StackTracePlainBench.returned_0128 avgt 25 4.544 0.002 ns/op
StackTracePlainBench.returned_0256 avgt 25 4.545 0.002 ns/op
StackTracePlainBench.returned_0512 avgt 25 170.192 59.456 ns/op
Note how returned_*
tests collapsed almost to the single return (the
latest one hits the internal inlining limits though).
You can clearly see the call stack dependence is not gone even with the deepest inlining, which suggests the cost is proportional to the depth of Java stack, not the native one. This additionally validates the hypothesis the major cost is the stack trace construction.
Lil' Exception is Thrown
Rationale
Here is where it gets seriously more complicated. After the exception is created, it is usually thrown. Throwing the exception in theory requires non-local control transfer, because the exception handler can be somewhere up the call stack. In practice, however, the target handler can be just in the same compilation unit, especially after inlining. But before we go there, let us do some basic tests.
In order to focus more on throwing costs, we will use the static exception,
thereby paying upfront the costs of creating the stack trace. There,
we can make a few throwing tests.
In hindsight, we do two tests: one with the default settings, and another
with inlining completely turned off (-XX:-Inline
).
Results
Running this on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield:
Benchmark Mode Samples Mean Mean error Units
ThrowingBench.exception_inline avgt 25 4.166 0.008 ns/op
ThrowingBench.exception_noInline avgt 25 248.034 1.160 ns/op
ThrowingBench.plain_inline avgt 25 1.265 0.007 ns/op
ThrowingBench.plain_noInline avgt 25 6.478 0.009 ns/op
The costs for turning off the inlining for plain methods are expected: a few
nanoseconds here and there for two non-inlined calls on the hotpath (first
one is the benchmark method itself, the second is its callee). Now what is
going on with exception_noInline
? It is actually educational at this point
to dig into the assembly to see how exception handling works at a very basic
level.
Consider the generated assembly for exception_inline
first
(I rearranged and added the comments, purged the non-essential addresses,
etc):
; {method} 'exception_inline' '()I' in 'net/shipilev/perf/exceptions/ThrowingBench'
; (...skipped...)
; Verified Entry Point
mov %eax,-0x14000(%rsp)
push %rbp
sub $0x10,%rsp
;*getfield staticException
mov 0x10(%rsi),%r11d
;*getfield metadata
; implicit exception: dispatches to $HANDLER
mov 0x20(%r12,%r11,8),%eax
add $0x10,%rsp
pop %rbp
; safepoint poll, and return!
test %eax,0x9e40280(%rip)
retq
; exception handler
HANDLER:
mov $0xfffffff6,%esi
nop
; calling runtime to raise the exception
callq 0x00007f76047eb320 ; OopMap{off=76}
callq 0x00007f760d3eeb10 ;*athrow
; do not expect to return, halt otherwise
hlt
What do we see here? Well, there is an exception handler lurking. However,
this is not our exception handler, that is an implicit null check for
staticException
field. The common path loads staticException.metadata
into %eax
and almost immediately returns. Recall, that the x86_64 calling
convention puts the returned value into the %eax
. Note that even
though there is the try { } catch { }
section in the original Java code,
as far as compiler is concerned, this is just the form of control flow which
can be optimized to straightest-forward code. This part is easy, because
the target for the control transfer is right here since s1()
is inlined.
Let’s break that, and see what the generated assembly for
exception_noInline
looks like.
; {method} 'exception_noInline' '()I' in 'net/shipilev/perf/exceptions/ThrowingBench'
; (...skipped...)
; Verified Entry Point
mov %eax,-0x14000(%rsp)
push %rbp
sub $0x10,%rsp
; calling s1()
xchg %ax,%ax
callq 0x00007f5c981a7b60 ; OopMap{off=52}
HERE_IS_YOUR_VALUE_SIR:
add $0x10,%rsp
pop %rbp
; poll safepoint and return!
test %eax,0x9f136e1(%rip)
retq
; EXCEPTION HANDLER
; %eax is exception; typecheck for LilException
mov %rax,%r10
mov 0x8(%rax),%r11d
cmp $0xe0449fb9,%r11d
jne TYPE_CHECK_FAILED
; *getfield metadata
mov 0x20(%r10),%eax
jmp HERE_IS_YOUR_VALUE_SIR
; type-check failed, unhandled exception, drop the frame
; and get to VM crying the tears of evil, hoping somebody
; else will handle
TYPE_CHECK_FAILED:
mov %rax,%rsi
add $0x10,%rsp
pop %rbp
jmpq 0x00007f5c981d23a0
; expecting no return
hlt
What do we see now? As expected, the call to s1()
is not inlined.
But we are expected to receive the exception from there, right? Also,
we must get the metadata out of the exception and return it.
Here is how it works: the exception handler is right there in the
method, its code generated just after the final ret
. It is a VM
exception handler duty to pass control here if somebody past the
s1()
call throws the exception back to us. If such a throw occurs,
we read out metadata
into %eax
, jump right after the call
pretending nothing happened, and return as usual, concluding the receiver
part of the story.
And here is the shiny s1()
itself:
; {method} 's1' '()I' in 'net/shipilev/perf/exceptions/ThrowingBench'
; (..skipped...)
; Verified Entry Point
mov %eax,-0x14000(%rsp)
push %rbp
sub $0x10,%rsp
; getfield staticException
mov 0x10(%rsi),%r11d
; check for null and pass on
test %r11d,%r11d
je HOLY_CRAP_THIS_CANT_BE_HAPPENING
; everything is fine.
; unpack the compressed pointer to the exception?
lea (%r12,%r11,8),%rsi
; leave the frame and transfer to VM
add $0x10,%rsp
pop %rbp
jmpq 0x00007f5c981d23a0
; the exception is null, call around screaming and throwing NPEs
HOLY_CRAP_THIS_CANT_BE_HAPPENING:
mov $0xfffffff6,%esi
xchg %ax,%ax
callq 0x00007f5c981a9320
callq 0x00007f5ca0e84b10
; expect no return
hlt
So, we read the exception field, null-check it, and pass it on to VM handler. There is additional dance should the field be null, but we have seen it before. I will not go into the detail how the VM handler operates at this time.
Lil' Exception is Framed
Rationale
The previous experiment highlighted how try {} catch {}
is treated
in terms of control-flow constructs and how it can be get optimized quite
reliably. Let’s assess it a bit more rigorously.
We do a very basic benchmark
just to make sure these constructs do not yield any surprising behaviors
while we are at it. Of course, with enough try
/catch
/finally
blocks
you can have a very convoluted control flow, but you can have the same with
lots of branches!
The highlights of the benchmark:
-
We check whether declaring checked/unchecked exceptions make any difference.
-
Our test methods never throw, although the compiler does not know that, because
source
is unpredictable. -
We always return the result, whether it was produced normally, exceptionally, or through hijacking
finally
.
Results
Running this on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield:
Benchmark Mode Samples Mean Mean error Units
TryCatchBench.checked_tryCatch avgt 25 1.515 0.004 ns/op
TryCatchBench.checked_tryCatchFinally avgt 25 1.513 0.003 ns/op
TryCatchBench.checked_tryCatch_FinallyEmpty avgt 25 1.517 0.004 ns/op
TryCatchBench.no_plain avgt 25 1.514 0.003 ns/op
TryCatchBench.no_tryCatch avgt 25 1.517 0.003 ns/op
TryCatchBench.no_tryCatchFinally avgt 25 1.517 0.004 ns/op
TryCatchBench.no_tryCatch_FinallyEmpty avgt 25 1.516 0.002 ns/op
TryCatchBench.unchecked_plain avgt 25 1.518 0.005 ns/op
TryCatchBench.unchecked_tryCatch avgt 25 1.516 0.002 ns/op
TryCatchBench.unchecked_tryCatchFinally avgt 25 1.514 0.003 ns/op
TryCatchBench.unchecked_tryCatch_FinallyEmpty avgt 25 1.517 0.003 ns/op
So, there is no difference, no matter how convoluted the local control flow
is. Let’s make it a bit harder with -XX:-Inline
:
Benchmark Mode Samples Mean Mean error Units
TryCatchBench.checked_tryCatch avgt 25 9.447 0.051 ns/op
TryCatchBench.checked_tryCatchFinally avgt 25 9.431 0.070 ns/op
TryCatchBench.checked_tryCatch_FinallyEmpty avgt 25 9.423 0.039 ns/op
TryCatchBench.no_plain avgt 25 9.326 0.026 ns/op
TryCatchBench.no_tryCatch avgt 25 9.675 0.059 ns/op
TryCatchBench.no_tryCatchFinally avgt 25 9.622 0.036 ns/op
TryCatchBench.no_tryCatch_FinallyEmpty avgt 25 9.663 0.114 ns/op
TryCatchBench.unchecked_plain avgt 25 9.235 0.027 ns/op
TryCatchBench.unchecked_tryCatch avgt 25 9.377 0.027 ns/op
TryCatchBench.unchecked_tryCatchFinally avgt 25 9.391 0.034 ns/op
TryCatchBench.unchecked_tryCatch_FinallyEmpty avgt 25 9.372 0.019 ns/op
Now, there is a little difference for no_*
family of tests, but the
difference in question is explainable by minute differences in generated code
after inlining.
Nothing to see here, move along. The quick tests like these catch the surprises early. This is especially important to do when you are new to the domain, and have not yet learned all the landmines in the field. If you haven’t yet blown yourselves up, that does not mean you should stop looking.
Lil' Exception is Pushed Around
Rationale
As we saw in our basic throwing tests, should we actually require the non-local control transfer, we invoke the VM handler to look for the appropriate exception handler. Let’s quantify this a bit more. You might notice that we pop the frame right before we leave the callee which produced the unhandled exception, before transferring the control to the VM stub. We actually have to transfer the control to the caller method exception handlers, and if our caller also cannot handle the exception, then we have to pop its frame as well, and continue looking at the caller’s caller, etc. before we find that matching exception handler. This process is often called stack unwinding.
The important thing to keep in mind here is: while the stack traces contain the "frames" as written out in the original program, the runtime can have different perspective on things. We will show this with a targeted benchmark. Notice several things in that benchmark:
-
We take static exception to drop out stack trace costs.
-
We construct a long chain of methods, and leaf method which throws.
-
We measure this in different modes: with default inlining, with inlining turned off (
-XX:-Inline
), and gradually tuning the inlining thresholds (with-XX:MaxInlineLevel=#
)
Results
Running this on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield:
Benchmark Mode Samples Mean Mean error Units
StackUnwindingBench.long_exception_inline avgt 25 234.947 3.714 ns/op
StackUnwindingBench.long_exception_noInline avgt 25 4647.198 28.181 ns/op
StackUnwindingBench.long_exception_inline_00 avgt 25 2188.245 19.086 ns/op
StackUnwindingBench.long_exception_inline_01 avgt 25 1707.567 12.551 ns/op
StackUnwindingBench.long_exception_inline_02 avgt 25 1229.186 5.186 ns/op
StackUnwindingBench.long_exception_inline_03 avgt 25 975.936 5.483 ns/op
StackUnwindingBench.long_exception_inline_04 avgt 25 735.410 4.403 ns/op
StackUnwindingBench.long_exception_inline_05 avgt 25 746.726 6.015 ns/op
StackUnwindingBench.long_exception_inline_06 avgt 25 497.924 7.339 ns/op
StackUnwindingBench.long_exception_inline_07 avgt 25 491.823 4.129 ns/op
StackUnwindingBench.long_exception_inline_08 avgt 25 505.133 7.140 ns/op
StackUnwindingBench.long_exception_inline_09 avgt 25 230.584 2.700 ns/op
StackUnwindingBench.long_exception_inline_10 avgt 25 231.575 2.396 ns/op
StackUnwindingBench.long_exception_inline_11 avgt 25 233.448 1.977 ns/op
StackUnwindingBench.long_exception_inline_12 avgt 25 232.751 2.389 ns/op
StackUnwindingBench.long_exception_inline_13 avgt 25 232.258 2.134 ns/op
StackUnwindingBench.long_exception_inline_14 avgt 25 232.807 4.310 ns/op
StackUnwindingBench.long_exception_inline_15 avgt 25 230.926 1.247 ns/op
StackUnwindingBench.long_exception_inline_16 avgt 25 232.417 3.181 ns/op
StackUnwindingBench.long_exception_inline_17 avgt 25 231.465 3.616 ns/op
StackUnwindingBench.long_exception_inline_18 avgt 25 230.850 3.158 ns/op
StackUnwindingBench.long_exception_inline_19 avgt 25 232.098 2.809 ns/op
StackUnwindingBench.long_exception_inline_20 avgt 25 4.297 0.154 ns/op
StackUnwindingBench.long_exception_inline_30 avgt 25 4.307 0.152 ns/op
Notice several things:
-
With the deepest inlining at level 20 and beyond, the stack unwinding is essentially stopped, since we inline the entire method chain down to the throwing leaf method.
-
Once we have at least one exception handler transfer, we lose around 230 ns. Recall our previous tests where we called a single throwing methods — these are the same 230 ns.
-
As we lower the inlining threshold, we effectively break the call chain into smaller inlined chunks. You can use
-XX:+PrintInlining
to review. Here’s a rough description on what is happening. At level 19, we have inlined everything fromcallLongCat()
throughl18()
, with a lone leaf ofl19()
, thus having a single lookup, and 230 ns. Nothing happened until we dropped up to level 8, in which case, we inlined everything fromcallLongCat()
throughl08()
, and then inlined everything froml09()
through tol18()
, and then kept a lonel19()
, thus having two lookups, and 500 ns. This goes on and on as we lower the inlining threshold. -
Default
*_inline
test "enjoys" the default HotSpot’s setting of-XX:MaxInlineLevel=9
.
Lil' Exception Meets The Obnoxious Flags
Rationale
Armed with the experiments from the Lil' Exception’s infancy, we can now conduct the ultimate adolescence test for Exceptions. This is the benchmark code. The highlights for the tests are as follows:
-
We create a deep chain of 20 method calls, with the last method sometimes producing the exceptional condition.
-
We deliver the exceptional return with the given ppm (parts-per-million) frequency. Since we need to juggle ppm in some non-trivial manner, we opt to use JMH API, and run all the benchmarks with a optional setting, consume their results, and print them in CSV.
-
We deliver the exceptional return via several means:
-
dynamic
: throws the dynamic exception from the leaf, and catches it in the root -
static
: throws the static exception from the leaf, and catches it in the root -
dynamicChain
: chains the dynamic exceptions at every level. It follows the usual practice of wrapping the exceptions to match API, provide additional metadata, etc. -
dynamicRethrow
: rethrows the caught dynamic exception at every level. There is no syntactic reason to do this, but it potentially exhibits different handler lookup patterns, since the exception is almost always caught at the caller. -
staticRethrow
: ditto, but for static exceptions. -
dynamicStackless
: overridesfillInStackTrace()
, producing the stackless exception. This is a questionable practice, but we will quantify it anyway. -
flags
: does not use exceptions at all, uses theint
flags to mark the erroneous condition.
-
-
All methods in the chain do the work on inputs, in order to prohibit quick fall-through of leaf result straight to the root
The flamewars around exceptions vs. flags often go around performance. Let’s see how exceptions totally devastate the flags performance!
Results
Deepest Inline
Since we know the inlining has the bearing on exceptions performance, let’s
quantify the best case scenario first. That is, plug in
-XX:MaxInlineLevel=1500
to the run. Running the test on my i5
(down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield this:
"WHOA", should have said the part of developers, and head out to cache exceptions. While they are busy with screwing up their code which I will then fix for money, we’ll zoom in at Y axis:
So, yeah, static
exceptions are arguably good, and dynamicStackless
is
almost catching up. See how other dynamic
exceptions are only slower
after some frequency. We will discuss what does it mean a bit later,
but before that we go for another estimate.
Shallowest Inline
The worst case behavior is arguably when the inlining is turned off
completely with -XX:-Inline
. Running the experiment like that will
yield this chart:
Where’s your static
exceptions now? Zooming in:
Well, that’s fairly erratic behavior, which is explained by more non-determinism as we compile all methods individually. But the important thing to notice is that exceptions are either indistinguishable in the noise, or outright lose.
The Middle Ground
Of course, the two cases before are the extreme examples. For something more real-world, let’s not hijack inlining in any sense, but rather leave the VM alone. Re-running the test with default VM options yields:
Now hold on a minute, this is not really different from either case. Of course, the devil in the details. If we zoom in again:
Now both dynamic
and static
exceptions are faster than flags on some
low exceptional frequency.
The Explanation
What do we actually see here?
-
flags
indeed provides more or less consistent performance, except for the last part when they are starting to mess with execution profiles and branch predictors -
exception
actually works better than flags for up to roughly 10-3 frequency in the default mode. The reason is that the non-exceptional path enjoys no additional overheads of checking the exceptions, since the exception handling code is laid out and handled elsewhere. Of course as exceptions are getting more frequent, we pay for both stack traces, and VM handling of the exceptions. -
static
is even faster, and works better than flags for up to roughly 10-2 frequency in default mode, since it does not pay the cost of constructing the stack traces. When deeply inlined, we do not even pay the cost of stack unwinding while looking for the appropriate exception handler! -
dynamicStackless
is almost on par withstatic
, because we do not construct the stack trace. However, it is still an object allocation, and we pay the additional cost for that.
That means, when exceptions are exceptional, they bring the best performance of all! If FP folks can preach that Monads are the sane way to construct the software to lead the execution "through the happy path", then I can argue Exceptions are the great feature that leads the generated code "through the happy fast path".
Conclusion
What did we learn from this experience?
-
Truly exceptional exceptions are beautifully performant. If you use them as designed, and only communicate the truly exceptional cases among overwhelmingly large number of non-exceptional cases handled by regular code, then using exceptions is the performance win.
-
The performance costs of exceptions have two major components: stack trace construction when Exception is instantiated, and stack unwinding during Exception throw.
-
Stack trace construction costs are proportional to stack depth at the moment of exception instantiation. That is already bad, because who on Earth knows the stack depth at which this throwing method would be called? Even if you turn off the stack trace generation and/or cache the exceptions, you can only get rid of this part of the performance cost.
-
Stack unwinding costs depend on how lucky we are with bringing the exception handler closer in the compiled code. Carefully structuring the code to avoid deep exception handlers lookup is probably helping us get more lucky.
-
Should we eliminate both effects, the performance cost of exceptions is that of local branch. No matter how beautiful it sounds, that does not mean you should use Exceptions as the usual control flow, because in that case you are at the mercy of optimizing compiler! You should only use them in truly exceptional cases, where the exception frequency amortizes the possible unlucky cost of raising the actual exception.
-
The optimistic rule-of-thumb seems to be 10-4 frequency for exceptions is exceptional enough. That of course depends on the heavy-weightness of the exceptions themselves, the exact actions taken in exception handlers, etc.
Or, to put it differently:
Thank you, Captain Hindsight!