The Exceptional Performance of Lil' Exception

Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

Note	This post is also available in ePUB and mobi.

Preface

The post you’re reading is loosely based on two Russian blog posts (1) (2) I wrote a few years ago. Since that time, the JMH microbenchmarking framework has been released, making it possible to render the previous discussion more rigorous and quantitative. Those posts began when someone had asked me about the performance of the try-catch construct, and the performance implications of Exceptions in general. In this English version we unify and extend the previous Russian posts.

We would like to thank Marcus Lagergren for quick sanity checks on the post, and Jerry Kuch for lots of editorial comments on my non-native English. Thanks guys!

Along the way, we will take some diversions into benchmarking methodology, effectively making this post a primer on quantifying the performance of language features in a concrete implementation. If you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.

Brace yourselves and good luck.

Lil' Exception is Conceived

Let us start with the exceptions themselves. This is the sample Exception we will be dealing with thorough the post:

package net.shipilev.perf.exceptions;

public class LilException extends Exception {
    private final int metadata;

    public LilException(int metadata) {
        super();
        this.metadata = metadata;
    }

    public LilException(LilException e, int metadata) {
        super(e);
        this.metadata = metadata;
    }

    public int getMetadata() {
        return metadata;
    }
}

Note that the above exception contains the LilException.metadata field, which vaguely simulates the sort of auxiliary information that might be placed in an Exception upon its creation. LilException.metadata is important for our benchmarking purposes, because hollow exceptions behave a little differently, as you will see below.

Lil' Exception’s Prospects of Life

Rationale

To get started, we should first do some very basic benchmarks demonstrating throwing and immediately catching the exceptions in a simple try-catch body. You can find the complete code for this benchmark here, but we will highlight the important points below:

First of all, let’s consider a few scenarios, namely: returning a value normally, throwing the dynamic exception, throwing the static exception. Note that all of our benchmark targets are marked with the @CompilerControl annotation to force inlining. If you took my earlier advice and looked around the JMH project, you may recall that @CompilerControl is a VM-specific mechanism for influencing how the JVM’s JIT compiler will treat our benchmark methods.

@CompilerControl(CompilerControl.Mode.INLINE)
public int doSomething() {
    return source;
}

@CompilerControl(CompilerControl.Mode.INLINE)
public int doSomething_Exception() throws LilException {
    throw new LilException(source);
}

@CompilerControl(CompilerControl.Mode.INLINE)
public int doSomething_Exception_Static() throws LilException {
    throw staticException;
}

Static exceptions are exception instances which are pre-cached in a field. In practice, such usages might be considered either a code smell or a performance optimization, depending on context, but for our current purposes they’re important as a control, providing an additional reference point against which to validate our thinking about exception performance:

@State(Scope.Thread)
public class BasicBench {

    LilException staticException;
    int source;

    @Setup(Level.Iteration)
    public void setup() {
        staticException = new LilException(source);
    }
...

We choose to instantiate LilException using the source field in target methods deliberately: under JMH rules, constant optimizations for the source field are inhibited, preventing the runtime from caching the exception. In practice, the runtime would not do perform this caching today, even without the source argument used as above, but in benchmarking, you’re better safe than sorry: if you can explicitly prevent a runtime optimization that interferes with the phenomenon you seek to measure, do so.
The returned values are always returned from methods annotated with @GenerateMicroBenchmark, which inhibits the JVM’s dead-code elimination optimizations. This is how it looks in the code:

@GenerateMicroBenchmark
public int plain() {
    return doSomething();
}

There are additional tests which use either LilException.metadata or the stack trace.
There are even more tests doing the same thing, but using the special -XX:-StackTraceInThrowable JVM option, which skips the filling in of stack trace information. Again, this is not considered a good practice in production use, but it is important as additional control. Some guys are trying to get the same behavior narrowed to particular exceptions by overriding fillInStackTrace(), we will try that later.

Results

Running the described test on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield:

Benchmark                                       Mode   Samples         Mean   Mean error  Units

BasicBench.dynamicException                     avgt        25     1901.196       14.572  ns/op
BasicBench.dynamicException_NoStack             avgt        25       67.029        0.212  ns/op
BasicBench.dynamicException_NoStack_UsedData    avgt        25       68.952        0.441  ns/op
BasicBench.dynamicException_NoStack_UsedStack   avgt        25      137.329        1.039  ns/op
BasicBench.dynamicException_UsedData            avgt        25     1900.770        9.359  ns/op
BasicBench.dynamicException_UsedStack           avgt        25    20033.658      118.600  ns/op

BasicBench.plain                                avgt        25        1.259        0.002  ns/op
BasicBench.staticException                      avgt        25        1.510        0.001  ns/op
BasicBench.staticException_NoStack              avgt        25        1.514        0.003  ns/op
BasicBench.staticException_NoStack_UsedData     avgt        25        4.185        0.015  ns/op
BasicBench.staticException_NoStack_UsedStack    avgt        25       19.110        0.051  ns/op
BasicBench.staticException_UsedData             avgt        25        4.159        0.007  ns/op
BasicBench.staticException_UsedStack            avgt        25       25.144        0.186  ns/op

Ok, what do we see here? I’m playing Captain Hindsight a bit, for otherwise this post would be the chronicle of a two-day back-and-forth journey.

First, and foremost, observe that the baseline plain test executes in just over 1 ns, which is the physical limit for my hardware to add up a long, read the boolean from memory, and do a branch. (It is remarkable how many benchmark harnesses fail or refuse to run this basic test).
The staticException_* tests are almost as fast as plain. This is because the target methods had inlined, and we do not do any additional heavy-weight work for exceptions. The exceptions are reduced to just control-flow effects.
The dynamicException_* tests are not as happy. We can see that the process of raising the exception is rather costly, varying over several orders of magnitude beyond what the cost of plain.
Observe the difference between dynamicException and dynamicException_NoStack. These dramatically illustrate the cost of creating the VM view of the stack trace.
The contrast between dynamicException and dynamicException_UsedStack further dramatizes these costs: if we then request the stack trace, we force the runtime to translate the VM stack trace into the Java stack trace, which costs a lot the first time around.
The translated Java stack trace is cached, as you can see from staticException_UsedStack, which did the first lazy conversion during the warmup.
Note that the *_UsedStack tests still have to defensively copy out the StackTraceElement[] array, which explains minute differences between staticException_UsedStack and staticException_NoStack_UsedStack. In the latter case, the array is shorter, because there hardly any stack trace info to copy out.
The same is true for the difference between dynamicException_UsedStack and dynamicException_NoStack_UsedStack. No stack trace, no performance problem!
The *UsedData tests are showcasing what happens when we use the exception to pass metadata around. In this case, exception objects are required to hold the metadata, although in the case of this particular benchmark the exception instances can _potentially be scalarized (in my runs, they really didn’t, because we skip inlining of exception methods, and escape analysis breaks).

At this point, some people might stop digging, and flee, thinking exceptions are so damn costly, they should not be used anywhere. I especially loathe bringing up the performance arguments in the API flamewars about exceptions. But we are not like those people. We understand there should be conditions where the feature is actually nicely applicable. Before we move on and set up the good experiment, we need to further understand how these beasts behave. This will help us to devise better benchmarks.

Lil' Exception is Born

Rationale

Exceptions catch the stack trace at the instantiation point. This feature is so damn useful, that I often find myself debugging with exceptions to see the backtraces to some chunk of code under test. Note that you don’t even have to throw the exception to make it useful:

  new Exception().printStackTrace();

In the above, the stack trace is filled when exception is instantiated, so the cost of instantiating the exception should be dependent on how deep we are in the call stack. Here is the simple benchmark for the case like that. That benchmark relies on a simple template method, which is then copy-pasted to specialize for XXXX. This is a moderately ugly way to do a quick benchmark. We’ll see how to do better later. Here is our benchmark code, so far:

Integer source = 42;

@GenerateMicroBenchmark
public Object exception_XXXX() {
    return call(XXXX, true);
}

@GenerateMicroBenchmark
public Object returned_XXXX() {
    return call(XXXX, false);
}

public Object call(int depth, boolean except) {
    if (depth == 0) {
        if (except) {
            return new LilException(source);
        } else {
            return source;
        }
    } else {
        return call(depth - 1, except);
    }
}

Note several things about this benchmark:

All inputs are always dependent on source, which breaks constant optimizations as before.
All outputs are sinked back to avoid dead-code elimination. Doing so required us to use the most generic Object, because we need both LilException and Integer as the possible targets. That also required us to bump source from int to Integer to pay the upfront cost of auto-boxing before the benchmark.
There is the returned_* benchmark for the non-exceptional cases, to quantify the effect of calling multiple methods in a row; if that effect is on the same scale as the effect we are after, the benchmark is moot.
A non-lazy person would further specialize call for exceptional and non-exceptional cases, but I opted instead to review the final generated code.
A more cautious person might be inclined to obsess about tail-call optimizations here. TCO is not yet implemented in our JVM, and I checked in the generated assembly it did not happen.
An even more cautious person will obsess about inlining; actually, recursive inlining in current HotSpot versions is effectively banned, and I verified that call is not inlined. This is actually a good thing for this experiment, because we don’t have to count in the inlining things, as we would have to do later.

Results

Running this on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield:

Benchmark                                 Mode   Samples         Mean   Mean error  Units

StackTraceConstructBench.exception_0000   avgt        25     1959.068       30.783  ns/op
StackTraceConstructBench.exception_0001   avgt        25     1945.958       12.104  ns/op
StackTraceConstructBench.exception_0002   avgt        25     2063.575       47.708  ns/op
StackTraceConstructBench.exception_0004   avgt        25     2211.882       29.417  ns/op
StackTraceConstructBench.exception_0008   avgt        25     2472.729       57.336  ns/op
StackTraceConstructBench.exception_0016   avgt        25     2950.847       29.863  ns/op
StackTraceConstructBench.exception_0032   avgt        25     4416.548       50.340  ns/op
StackTraceConstructBench.exception_0064   avgt        25     6845.140       40.114  ns/op
StackTraceConstructBench.exception_0128   avgt        25    11774.758       54.299  ns/op
StackTraceConstructBench.exception_0256   avgt        25    21617.526      101.379  ns/op
StackTraceConstructBench.exception_0512   avgt        25    42780.434      144.594  ns/op
StackTraceConstructBench.exception_1024   avgt        25    82839.358      291.434  ns/op

StackTraceConstructBench.returned_0000    avgt        25        4.551        0.005  ns/op
StackTraceConstructBench.returned_0001    avgt        25        4.552        0.006  ns/op
StackTraceConstructBench.returned_0002    avgt        25        7.079        0.019  ns/op
StackTraceConstructBench.returned_0004    avgt        25       10.113        0.024  ns/op
StackTraceConstructBench.returned_0008    avgt        25       16.185        0.068  ns/op
StackTraceConstructBench.returned_0016    avgt        25       28.891        0.048  ns/op
StackTraceConstructBench.returned_0032    avgt        25       53.466        0.131  ns/op
StackTraceConstructBench.returned_0064    avgt        25      119.373        0.270  ns/op
StackTraceConstructBench.returned_0128    avgt        25      221.252        0.485  ns/op
StackTraceConstructBench.returned_0256    avgt        25      433.791        1.337  ns/op
StackTraceConstructBench.returned_0512    avgt        25      840.277        8.315  ns/op
StackTraceConstructBench.returned_1024    avgt        25     1686.329        5.122  ns/op

Hence, our hypothesis is validated: the deeper the call stack, the higher the cost of exception instantiation. It is actually quite scary to waste 80 microseconds to instantiate the exception buried down in 1024 calls. Hi there, Spring! Hi there, appservers! Let us bisect that effect a bit by doing -XX:-StackTraceInThrowable:

Benchmark                                 Mode   Samples         Mean   Mean error  Units

StackTraceConstructBench.exception_0000   avgt        25       69.262        0.301  ns/op
StackTraceConstructBench.exception_0001   avgt        25       69.271        0.377  ns/op
StackTraceConstructBench.exception_0002   avgt        25       72.484        0.533  ns/op
StackTraceConstructBench.exception_0004   avgt        25       75.964        0.630  ns/op
StackTraceConstructBench.exception_0008   avgt        25       84.303        0.868  ns/op
StackTraceConstructBench.exception_0016   avgt        25       99.583        0.746  ns/op
StackTraceConstructBench.exception_0032   avgt        25      183.691        2.354  ns/op
StackTraceConstructBench.exception_0064   avgt        25      296.708        3.317  ns/op
StackTraceConstructBench.exception_0128   avgt        25      536.303        8.986  ns/op
StackTraceConstructBench.exception_0256   avgt        25      990.993        3.446  ns/op
StackTraceConstructBench.exception_0512   avgt        25     1928.261        5.526  ns/op
StackTraceConstructBench.exception_1024   avgt        25     6085.115       13.070  ns/op

StackTraceConstructBench.returned_0000    avgt        25        4.553        0.010  ns/op
StackTraceConstructBench.returned_0001    avgt        25        4.556        0.010  ns/op
StackTraceConstructBench.returned_0002    avgt        25        7.088        0.046  ns/op
StackTraceConstructBench.returned_0004    avgt        25       10.305        0.286  ns/op
StackTraceConstructBench.returned_0008    avgt        25       16.211        0.061  ns/op
StackTraceConstructBench.returned_0016    avgt        25       28.905        0.085  ns/op
StackTraceConstructBench.returned_0032    avgt        25       53.426        0.262  ns/op
StackTraceConstructBench.returned_0064    avgt        25      119.256        0.299  ns/op
StackTraceConstructBench.returned_0128    avgt        25      221.135        0.652  ns/op
StackTraceConstructBench.returned_0256    avgt        25      433.753        0.805  ns/op
StackTraceConstructBench.returned_0512    avgt        25      835.677        0.917  ns/op
StackTraceConstructBench.returned_1024    avgt        25     1685.756        4.288  ns/op

Thus, we conclude that the dominating performance effect is the filling in of the stack trace. There is a second-order effect on par with the cost of calling the method chain itself, but we will leave that for the future. (It is actually cruel, because it requires heavy digging in all of the special cases the VM has for exception handling).

Another question is, does the cost depend on the depth of Java call stack, or the native call stack? That is, do we spend more time populating the Java view of call stack, or we actually wander around compiled code call stacks to figure out what’s where? We can check that with more aggressive inlining, but that will also require us to unroll the the call chain in the new benchmark, to get rid of the recursive call. See, we were lazy before, and now we ended up with two benchmarks instead of one.

Running with -XX:-Inline will make Java stack look roughly the same as the native stack:

Benchmark                               Mode   Samples         Mean   Mean error    Units

StackTracePlainBench.exception_0000     avgt        25     2463.298       58.179    ns/op
StackTracePlainBench.exception_0001     avgt        25     2588.883       44.302    ns/op
StackTracePlainBench.exception_0002     avgt        25     2671.243       62.618    ns/op
StackTracePlainBench.exception_0004     avgt        25     2814.950       16.428    ns/op
StackTracePlainBench.exception_0008     avgt        25     3279.462       56.387    ns/op
StackTracePlainBench.exception_0016     avgt        25     4254.165       96.825    ns/op
StackTracePlainBench.exception_0032     avgt        25     6448.725       25.428    ns/op
StackTracePlainBench.exception_0064     avgt        25    10604.247       27.803    ns/op
StackTracePlainBench.exception_0128     avgt        25    18937.581      220.683    ns/op
StackTracePlainBench.exception_0256     avgt        25    38333.533      382.913    ns/op
StackTracePlainBench.exception_0512     avgt        25    80801.420      854.231    ns/op

StackTracePlainBench.returned_0000      avgt        25       12.449        0.035    ns/op
StackTracePlainBench.returned_0001      avgt        25       14.893        0.054    ns/op
StackTracePlainBench.returned_0002      avgt        25       18.096        1.068    ns/op
StackTracePlainBench.returned_0004      avgt        25       23.046        0.068    ns/op
StackTracePlainBench.returned_0008      avgt        25       34.278        0.544    ns/op
StackTracePlainBench.returned_0016      avgt        25       68.957        0.255    ns/op
StackTracePlainBench.returned_0032      avgt        25      316.878        6.055    ns/op
StackTracePlainBench.returned_0064      avgt        25      844.046        5.821    ns/op
StackTracePlainBench.returned_0128      avgt        25     1927.618       19.176    ns/op
StackTracePlainBench.returned_0256      avgt        25     5280.506      101.069    ns/op
StackTracePlainBench.returned_0512      avgt        25    14089.720       48.708    ns/op

This is similar to what we have seen with StackTraceConstruct benchmark. Now that we established this similarity, let us deeply inline the method chain with -XX:MaxInlineLevel=1500. This will collapse the native stack, while keeping the Java stack the same:

Benchmark                               Mode   Samples         Mean   Mean error    Units

StackTracePlainBench.exception_0000     avgt        25     1955.621       29.474    ns/op
StackTracePlainBench.exception_0001     avgt        25     1958.101       12.951    ns/op
StackTracePlainBench.exception_0002     avgt        25     1991.971       33.410    ns/op
StackTracePlainBench.exception_0004     avgt        25     2048.335        9.365    ns/op
StackTracePlainBench.exception_0008     avgt        25     2220.598       11.105    ns/op
StackTracePlainBench.exception_0016     avgt        25     2490.036       26.720    ns/op
StackTracePlainBench.exception_0032     avgt        25     3316.666       30.743    ns/op
StackTracePlainBench.exception_0064     avgt        25     4666.447       22.221    ns/op
StackTracePlainBench.exception_0128     avgt        25     7380.248       38.103    ns/op
StackTracePlainBench.exception_0256     avgt        25    13456.551       75.191    ns/op
StackTracePlainBench.exception_0512     avgt        25    34542.346      884.181    ns/op

StackTracePlainBench.returned_0000      avgt        25        4.547        0.006    ns/op
StackTracePlainBench.returned_0001      avgt        25        4.553        0.017    ns/op
StackTracePlainBench.returned_0002      avgt        25        4.546        0.003    ns/op
StackTracePlainBench.returned_0004      avgt        25        4.548        0.006    ns/op
StackTracePlainBench.returned_0008      avgt        25        4.546        0.006    ns/op
StackTracePlainBench.returned_0016      avgt        25        4.551        0.023    ns/op
StackTracePlainBench.returned_0032      avgt        25        4.547        0.004    ns/op
StackTracePlainBench.returned_0064      avgt        25        4.545        0.003    ns/op
StackTracePlainBench.returned_0128      avgt        25        4.544        0.002    ns/op
StackTracePlainBench.returned_0256      avgt        25        4.545        0.002    ns/op
StackTracePlainBench.returned_0512      avgt        25      170.192       59.456    ns/op

Note how returned_* tests collapsed almost to the single return (the latest one hits the internal inlining limits though).

You can clearly see the call stack dependence is not gone even with the deepest inlining, which suggests the cost is proportional to the depth of Java stack, not the native one. This additionally validates the hypothesis the major cost is the stack trace construction.

Lil' Exception is Thrown

Rationale

Here is where it gets seriously more complicated. After the exception is created, it is usually thrown. Throwing the exception in theory requires non-local control transfer, because the exception handler can be somewhere up the call stack. In practice, however, the target handler can be just in the same compilation unit, especially after inlining. But before we go there, let us do some basic tests.

In order to focus more on throwing costs, we will use the static exception, thereby paying upfront the costs of creating the stack trace. There, we can make a few throwing tests. In hindsight, we do two tests: one with the default settings, and another with inlining completely turned off (-XX:-Inline).

Results

Running this on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield:

Benchmark                          Mode   Samples         Mean   Mean error  Units

ThrowingBench.exception_inline     avgt        25        4.166        0.008  ns/op
ThrowingBench.exception_noInline   avgt        25      248.034        1.160  ns/op
ThrowingBench.plain_inline         avgt        25        1.265        0.007  ns/op
ThrowingBench.plain_noInline       avgt        25        6.478        0.009  ns/op

The costs for turning off the inlining for plain methods are expected: a few nanoseconds here and there for two non-inlined calls on the hotpath (first one is the benchmark method itself, the second is its callee). Now what is going on with exception_noInline? It is actually educational at this point to dig into the assembly to see how exception handling works at a very basic level.

Consider the generated assembly for exception_inline first (I rearranged and added the comments, purged the non-essential addresses, etc):

; {method} 'exception_inline' '()I' in 'net/shipilev/perf/exceptions/ThrowingBench'
; (...skipped...)
; Verified Entry Point
  mov    %eax,-0x14000(%rsp)
  push   %rbp
  sub    $0x10,%rsp

;*getfield staticException
  mov    0x10(%rsi),%r11d

;*getfield metadata
; implicit exception: dispatches to $HANDLER
  mov    0x20(%r12,%r11,8),%eax

  add    $0x10,%rsp
  pop    %rbp

; safepoint poll, and return!
  test   %eax,0x9e40280(%rip)
  retq

; exception handler
HANDLER:
  mov    $0xfffffff6,%esi
  nop

; calling runtime to raise the exception
  callq  0x00007f76047eb320  ; OopMap{off=76}
  callq  0x00007f760d3eeb10  ;*athrow

; do not expect to return, halt otherwise
  hlt

What do we see here? Well, there is an exception handler lurking. However, this is not our exception handler, that is an implicit null check for staticException field. The common path loads staticException.metadata into %eax and almost immediately returns. Recall, that the x86_64 calling convention puts the returned value into the %eax. Note that even though there is the try { } catch { } section in the original Java code, as far as compiler is concerned, this is just the form of control flow which can be optimized to straightest-forward code. This part is easy, because the target for the control transfer is right here since s1() is inlined.

Let’s break that, and see what the generated assembly for exception_noInline looks like.

; {method} 'exception_noInline' '()I' in 'net/shipilev/perf/exceptions/ThrowingBench'
; (...skipped...)
; Verified Entry Point
  mov    %eax,-0x14000(%rsp)
  push   %rbp
  sub    $0x10,%rsp

; calling s1()
  xchg   %ax,%ax
  callq  0x00007f5c981a7b60  ; OopMap{off=52}

HERE_IS_YOUR_VALUE_SIR:
  add    $0x10,%rsp
  pop    %rbp

; poll safepoint and return!
  test   %eax,0x9f136e1(%rip)
  retq

; EXCEPTION HANDLER
; %eax is exception; typecheck for LilException
  mov    %rax,%r10
  mov    0x8(%rax),%r11d
  cmp    $0xe0449fb9,%r11d
  jne    TYPE_CHECK_FAILED

; *getfield metadata
  mov    0x20(%r10),%eax
  jmp    HERE_IS_YOUR_VALUE_SIR

; type-check failed, unhandled exception, drop the frame
; and get to VM crying the tears of evil, hoping somebody
; else will handle
TYPE_CHECK_FAILED:
  mov    %rax,%rsi
  add    $0x10,%rsp
  pop    %rbp
  jmpq   0x00007f5c981d23a0

; expecting no return
  hlt

What do we see now? As expected, the call to s1() is not inlined. But we are expected to receive the exception from there, right? Also, we must get the metadata out of the exception and return it. Here is how it works: the exception handler is right there in the method, its code generated just after the final ret. It is a VM exception handler duty to pass control here if somebody past the s1() call throws the exception back to us. If such a throw occurs, we read out metadata into %eax, jump right after the call pretending nothing happened, and return as usual, concluding the receiver part of the story.

And here is the shiny s1() itself:

; {method} 's1' '()I' in 'net/shipilev/perf/exceptions/ThrowingBench'
; (..skipped...)
; Verified Entry Point
  mov    %eax,-0x14000(%rsp)
  push   %rbp
  sub    $0x10,%rsp

; getfield staticException
  mov    0x10(%rsi),%r11d

; check for null and pass on
  test   %r11d,%r11d
  je     HOLY_CRAP_THIS_CANT_BE_HAPPENING

; everything is fine.
; unpack the compressed pointer to the exception?
  lea    (%r12,%r11,8),%rsi

; leave the frame and transfer to VM
  add    $0x10,%rsp
  pop    %rbp
  jmpq   0x00007f5c981d23a0

; the exception is null, call around screaming and throwing NPEs
HOLY_CRAP_THIS_CANT_BE_HAPPENING:
  mov    $0xfffffff6,%esi
  xchg   %ax,%ax
  callq  0x00007f5c981a9320
  callq  0x00007f5ca0e84b10

; expect no return
  hlt

So, we read the exception field, null-check it, and pass it on to VM handler. There is additional dance should the field be null, but we have seen it before. I will not go into the detail how the VM handler operates at this time.

Lil' Exception is Framed

Rationale

The previous experiment highlighted how try {} catch {} is treated in terms of control-flow constructs and how it can be get optimized quite reliably. Let’s assess it a bit more rigorously. We do a very basic benchmark just to make sure these constructs do not yield any surprising behaviors while we are at it. Of course, with enough try/catch/finally blocks you can have a very convoluted control flow, but you can have the same with lots of branches!

The highlights of the benchmark:

We check whether declaring checked/unchecked exceptions make any difference.
Our test methods never throw, although the compiler does not know that, because source is unpredictable.
We always return the result, whether it was produced normally, exceptionally, or through hijacking finally.

Results

Running this on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield:

Benchmark                                       Mode   Samples         Mean   Mean error  Units
TryCatchBench.checked_tryCatch                  avgt        25        1.515        0.004  ns/op
TryCatchBench.checked_tryCatchFinally           avgt        25        1.513        0.003  ns/op
TryCatchBench.checked_tryCatch_FinallyEmpty     avgt        25        1.517        0.004  ns/op
TryCatchBench.no_plain                          avgt        25        1.514        0.003  ns/op
TryCatchBench.no_tryCatch                       avgt        25        1.517        0.003  ns/op
TryCatchBench.no_tryCatchFinally                avgt        25        1.517        0.004  ns/op
TryCatchBench.no_tryCatch_FinallyEmpty          avgt        25        1.516        0.002  ns/op
TryCatchBench.unchecked_plain                   avgt        25        1.518        0.005  ns/op
TryCatchBench.unchecked_tryCatch                avgt        25        1.516        0.002  ns/op
TryCatchBench.unchecked_tryCatchFinally         avgt        25        1.514        0.003  ns/op
TryCatchBench.unchecked_tryCatch_FinallyEmpty   avgt        25        1.517        0.003  ns/op

So, there is no difference, no matter how convoluted the local control flow is. Let’s make it a bit harder with -XX:-Inline:

Benchmark                                       Mode   Samples         Mean   Mean error  Units
TryCatchBench.checked_tryCatch                  avgt        25        9.447        0.051  ns/op
TryCatchBench.checked_tryCatchFinally           avgt        25        9.431        0.070  ns/op
TryCatchBench.checked_tryCatch_FinallyEmpty     avgt        25        9.423        0.039  ns/op
TryCatchBench.no_plain                          avgt        25        9.326        0.026  ns/op
TryCatchBench.no_tryCatch                       avgt        25        9.675        0.059  ns/op
TryCatchBench.no_tryCatchFinally                avgt        25        9.622        0.036  ns/op
TryCatchBench.no_tryCatch_FinallyEmpty          avgt        25        9.663        0.114  ns/op
TryCatchBench.unchecked_plain                   avgt        25        9.235        0.027  ns/op
TryCatchBench.unchecked_tryCatch                avgt        25        9.377        0.027  ns/op
TryCatchBench.unchecked_tryCatchFinally         avgt        25        9.391        0.034  ns/op
TryCatchBench.unchecked_tryCatch_FinallyEmpty   avgt        25        9.372        0.019  ns/op

Now, there is a little difference for no_* family of tests, but the difference in question is explainable by minute differences in generated code after inlining.

Nothing to see here, move along. The quick tests like these catch the surprises early. This is especially important to do when you are new to the domain, and have not yet learned all the landmines in the field. If you haven’t yet blown yourselves up, that does not mean you should stop looking.

Lil' Exception is Pushed Around

Rationale

As we saw in our basic throwing tests, should we actually require the non-local control transfer, we invoke the VM handler to look for the appropriate exception handler. Let’s quantify this a bit more. You might notice that we pop the frame right before we leave the callee which produced the unhandled exception, before transferring the control to the VM stub. We actually have to transfer the control to the caller method exception handlers, and if our caller also cannot handle the exception, then we have to pop its frame as well, and continue looking at the caller’s caller, etc. before we find that matching exception handler. This process is often called stack unwinding.

The important thing to keep in mind here is: while the stack traces contain the "frames" as written out in the original program, the runtime can have different perspective on things. We will show this with a targeted benchmark. Notice several things in that benchmark:

We take static exception to drop out stack trace costs.
We construct a long chain of methods, and leaf method which throws.
We measure this in different modes: with default inlining, with inlining turned off (-XX:-Inline), and gradually tuning the inlining thresholds (with -XX:MaxInlineLevel=#)

Results

Running this on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield:

Benchmark                                      Mode   Samples         Mean   Mean error  Units

StackUnwindingBench.long_exception_inline      avgt        25      234.947        3.714  ns/op
StackUnwindingBench.long_exception_noInline    avgt        25     4647.198       28.181  ns/op

StackUnwindingBench.long_exception_inline_00   avgt        25     2188.245       19.086  ns/op
StackUnwindingBench.long_exception_inline_01   avgt        25     1707.567       12.551  ns/op
StackUnwindingBench.long_exception_inline_02   avgt        25     1229.186        5.186  ns/op
StackUnwindingBench.long_exception_inline_03   avgt        25      975.936        5.483  ns/op
StackUnwindingBench.long_exception_inline_04   avgt        25      735.410        4.403  ns/op
StackUnwindingBench.long_exception_inline_05   avgt        25      746.726        6.015  ns/op
StackUnwindingBench.long_exception_inline_06   avgt        25      497.924        7.339  ns/op
StackUnwindingBench.long_exception_inline_07   avgt        25      491.823        4.129  ns/op
StackUnwindingBench.long_exception_inline_08   avgt        25      505.133        7.140  ns/op
StackUnwindingBench.long_exception_inline_09   avgt        25      230.584        2.700  ns/op
StackUnwindingBench.long_exception_inline_10   avgt        25      231.575        2.396  ns/op
StackUnwindingBench.long_exception_inline_11   avgt        25      233.448        1.977  ns/op
StackUnwindingBench.long_exception_inline_12   avgt        25      232.751        2.389  ns/op
StackUnwindingBench.long_exception_inline_13   avgt        25      232.258        2.134  ns/op
StackUnwindingBench.long_exception_inline_14   avgt        25      232.807        4.310  ns/op
StackUnwindingBench.long_exception_inline_15   avgt        25      230.926        1.247  ns/op
StackUnwindingBench.long_exception_inline_16   avgt        25      232.417        3.181  ns/op
StackUnwindingBench.long_exception_inline_17   avgt        25      231.465        3.616  ns/op
StackUnwindingBench.long_exception_inline_18   avgt        25      230.850        3.158  ns/op
StackUnwindingBench.long_exception_inline_19   avgt        25      232.098        2.809  ns/op
StackUnwindingBench.long_exception_inline_20   avgt        25        4.297        0.154  ns/op
StackUnwindingBench.long_exception_inline_30   avgt        25        4.307        0.152  ns/op

Notice several things:

With the deepest inlining at level 20 and beyond, the stack unwinding is essentially stopped, since we inline the entire method chain down to the throwing leaf method.
Once we have at least one exception handler transfer, we lose around 230 ns. Recall our previous tests where we called a single throwing methods — these are the same 230 ns.
As we lower the inlining threshold, we effectively break the call chain into smaller inlined chunks. You can use -XX:+PrintInlining to review. Here’s a rough description on what is happening. At level 19, we have inlined everything from callLongCat() through l18(), with a lone leaf of l19(), thus having a single lookup, and 230 ns. Nothing happened until we dropped up to level 8, in which case, we inlined everything from callLongCat() through l08(), and then inlined everything from l09() through to l18(), and then kept a lone l19(), thus having two lookups, and 500 ns. This goes on and on as we lower the inlining threshold.
Default *_inline test "enjoys" the default HotSpot’s setting of -XX:MaxInlineLevel=9.

Lil' Exception Meets The Obnoxious Flags

Rationale

Armed with the experiments from the Lil' Exception’s infancy, we can now conduct the ultimate adolescence test for Exceptions. This is the benchmark code. The highlights for the tests are as follows:

We create a deep chain of 20 method calls, with the last method sometimes producing the exceptional condition.
We deliver the exceptional return with the given ppm (parts-per-million) frequency. Since we need to juggle ppm in some non-trivial manner, we opt to use JMH API, and run all the benchmarks with a optional setting, consume their results, and print them in CSV.
We deliver the exceptional return via several means:
1. dynamic: throws the dynamic exception from the leaf, and catches it in the root
2. static: throws the static exception from the leaf, and catches it in the root
3. dynamicChain: chains the dynamic exceptions at every level. It follows the usual practice of wrapping the exceptions to match API, provide additional metadata, etc.
4. dynamicRethrow: rethrows the caught dynamic exception at every level. There is no syntactic reason to do this, but it potentially exhibits different handler lookup patterns, since the exception is almost always caught at the caller.
5. staticRethrow: ditto, but for static exceptions.
6. dynamicStackless: overrides fillInStackTrace(), producing the stackless exception. This is a questionable practice, but we will quantify it anyway.
7. flags: does not use exceptions at all, uses the int flags to mark the erroneous condition.
All methods in the chain do the work on inputs, in order to prohibit quick fall-through of leaf result straight to the root

The flamewars around exceptions vs. flags often go around performance. Let’s see how exceptions totally devastate the flags performance!

Results

Deepest Inline

Since we know the inlining has the bearing on exceptions performance, let’s quantify the best case scenario first. That is, plug in -XX:MaxInlineLevel=1500 to the run. Running the test on my i5 (down-clocked to 2.0 Ghz), Linux/x86_64, JDK 7u40 will yield this:

"WHOA", should have said the part of developers, and head out to cache exceptions. While they are busy with screwing up their code which I will then fix for money, we’ll zoom in at Y axis:

So, yeah, static exceptions are arguably good, and dynamicStackless is almost catching up. See how other dynamic exceptions are only slower after some frequency. We will discuss what does it mean a bit later, but before that we go for another estimate.

Shallowest Inline

The worst case behavior is arguably when the inlining is turned off completely with -XX:-Inline. Running the experiment like that will yield this chart:

Where’s your static exceptions now? Zooming in:

Well, that’s fairly erratic behavior, which is explained by more non-determinism as we compile all methods individually. But the important thing to notice is that exceptions are either indistinguishable in the noise, or outright lose.

The Middle Ground

Of course, the two cases before are the extreme examples. For something more real-world, let’s not hijack inlining in any sense, but rather leave the VM alone. Re-running the test with default VM options yields:

Now hold on a minute, this is not really different from either case. Of course, the devil in the details. If we zoom in again:

Now both dynamic and static exceptions are faster than flags on some low exceptional frequency.

The Explanation

What do we actually see here?

flags indeed provides more or less consistent performance, except for the last part when they are starting to mess with execution profiles and branch predictors
exception actually works better than flags for up to roughly 10^-3 frequency in the default mode. The reason is that the non-exceptional path enjoys no additional overheads of checking the exceptions, since the exception handling code is laid out and handled elsewhere. Of course as exceptions are getting more frequent, we pay for both stack traces, and VM handling of the exceptions.
static is even faster, and works better than flags for up to roughly 10^-2 frequency in default mode, since it does not pay the cost of constructing the stack traces. When deeply inlined, we do not even pay the cost of stack unwinding while looking for the appropriate exception handler!
dynamicStackless is almost on par with static, because we do not construct the stack trace. However, it is still an object allocation, and we pay the additional cost for that.

That means, when exceptions are exceptional, they bring the best performance of all! If FP folks can preach that Monads are the sane way to construct the software to lead the execution "through the happy path", then I can argue Exceptions are the great feature that leads the generated code "through the happy fast path".

Conclusion

What did we learn from this experience?

Truly exceptional exceptions are beautifully performant. If you use them as designed, and only communicate the truly exceptional cases among overwhelmingly large number of non-exceptional cases handled by regular code, then using exceptions is the performance win.
The performance costs of exceptions have two major components: stack trace construction when Exception is instantiated, and stack unwinding during Exception throw.
Stack trace construction costs are proportional to stack depth at the moment of exception instantiation. That is already bad, because who on Earth knows the stack depth at which this throwing method would be called? Even if you turn off the stack trace generation and/or cache the exceptions, you can only get rid of this part of the performance cost.
Stack unwinding costs depend on how lucky we are with bringing the exception handler closer in the compiled code. Carefully structuring the code to avoid deep exception handlers lookup is probably helping us get more lucky.
Should we eliminate both effects, the performance cost of exceptions is that of local branch. No matter how beautiful it sounds, that does not mean you should use Exceptions as the usual control flow, because in that case you are at the mercy of optimizing compiler! You should only use them in truly exceptional cases, where the exception frequency amortizes the possible unlucky cost of raising the actual exception.
The optimistic rule-of-thumb seems to be 10^-4 frequency for exceptions is exceptional enough. That of course depends on the heavy-weightness of the exceptions themselves, the exact actions taken in exception handlers, etc.

Or, to put it differently:

Thank you, Captain Hindsight!