Nanotrusting the Nanotime

Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

Note	This post is also available in ePUB and mobi.

Preface

Benchmarking business is hard, no surprises there. Most of the time benchmarking involves measuring time, and it is remarkable how many people believe time is very simple to measure. Better yet, most people believe that asking system for time is dirt cheap, and can be used to measure just about any effect one could want to measure. In this post, we will explore some of the remarkable pitfalls in measuring the time on modern computers, hopefully breaking some of the wishful thinking constructs along the way.

This post is the translation of benchmarking talk I gave in Russian in April 2014. I think it serves a good warning for a broader audience, so at expense of spoiling my conference circle this year (which I don’t want to do this year anyway), we will push this out early.

As a good tradition, we will take some diversions into benchmarking methodology, so that even though the post itself is targeted to platform people, the non-platform people can still learn a few tricks. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.

Building Performance Models

These days I find myself telling people that benchmark numbers don’t matter on their own. It’s important what models you derive from those numbers. Refined performance models are by far the noblest and greatest achievement one could get with the benchmarking — it contributes to understanding how computers, runtimes, libraries, and user code work together.

For the sake of demonstration, we will start with the sample problem which does not involve measuring time directly yet. Let us ask ourselves: "What is the cost of volatile write?". It seems a simple question to answer with benchmarking, right? Shove in some threads, do some measurements, done!

OK then, let’s run this benchmark:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
@Warmup(iterations = 5, time = 200, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 200, timeUnit = TimeUnit.MILLISECONDS)
@Fork(50)
public class VolatileWriteSucks {

    private int plainV;
    private volatile int volatileV;

    @GenerateMicroBenchmark
    public int baseline() {
        return 42;
    }

    @GenerateMicroBenchmark
    public int incrPlain() {
        return plainV++;
    }

    @GenerateMicroBenchmark
    public int incrVolatile() {
        return volatileV++;
    }

}

Let’s measure this on some handy platform, say my laptop (2x2 i5-2520M, 2.0 GHz, Linux x86_64, JDK 8 GA) with a single worker thread:

Benchmark                             Mode   Samples         Mean   Mean error    Units
o.s.VolatileWriteSucks.baseline       avgt       250        2.042        0.017    ns/op
o.s.VolatileWriteSucks.incrPlain      avgt       250        3.589        0.025    ns/op
o.s.VolatileWriteSucks.incrVolatile   avgt       250       15.219        0.114    ns/op

Okay. Volatile writes are almost 5x slower! That means if we use volatile writes in my application, it becomes 5x slower! We can avoid volatiles at all costs, and get the immediate performance boost! Yeah, well… I don’t know how to break this to people, but this experiment has a fatal flaw.

That flaw is not with benchmark methodology, almost. The benchmark truly measures what it intended to measure: how much time we spend incrementing the volatile variable in these particular conditions. But is it the thing we really want to know: how system performs when we bash it with heavy-weight operations? Surely not, our production code is not that stupid. In real code, the heavy-weight operations are mixed with relatively low-weight ops, which amortize the costs. Therefore, to gain a useful data from the experiment, we need to simulate that mix.

Emulating a real workload is a painful exercise on its own. Luckily, we faced this issue so frequently, that JMH has the emulator of its own. Meet BlackHole.consumeCPU(int tokens). It "consumes" CPU time linear to numbers tokens, and hopefully does it without the contention and messing with other computations. It does not sleep, but really does burn off CPU time. This enables us to make a more complicated experiment, which will guide us towards clearer performance model:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
@Warmup(iterations = 5, time = 200, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 200, timeUnit = TimeUnit.MILLISECONDS)
@Fork(50)
public class VolatileBackoff {

    @Param({"0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
            "10", "11", "12", "13", "14", "15", "16", "17", "18", "19",
            "20", "21", "22", "23", "24", "25", "26", "27", "28", "29",
            "30", "31", "32", "33", "34", "35", "36", "37", "38", "39","40"})
    private int tokens;

    private int plainV;
    private volatile int volatileV;

    @GenerateMicroBenchmark
    public void baseline_Plain() {
        BlackHole.consumeCPU(tokens);
    }

    @GenerateMicroBenchmark
    public int baseline_Return42() {
        BlackHole.consumeCPU(tokens);
        return 42;
    }

    @GenerateMicroBenchmark
    public int baseline_ReturnV() {
        BlackHole.consumeCPU(tokens);
        return plainV;
    }

    @GenerateMicroBenchmark
    public int incrPlain() {
        BlackHole.consumeCPU(tokens);
        return plainV++;
    }

    @GenerateMicroBenchmark
    public int incrVolatile() {
        BlackHole.consumeCPU(tokens);
        return volatileV++;
    }
}

There, we "back off" a little bit before doing the operation under test. @Param allows us to juggle the backoff, and so it helps to estimate how well the amortization is working. You may also notice there are two more baseline implementations, we put it there deliberately to illustrate another point below.

OK, doing the experiment:

If you do the graphs like these and think it’s alright, you are going to Hell. If I am there earlier, I will make sure you will be shown the cryptic trend lines potentially having the data how to break free and get to Heaven. Those charts, obviously, would not say anything you would want them to say. Ever.

We may want to contrast the chart a bit by subtracting baseline_Plain, and ignoring other baselines for a moment:

Looks cool, and seems to prove the amortizing works. It would seem from the data that after 20 consumed tokens, we may stop to care about the cost of volatile ops. Looking back at experimental data, or the chart above (yes, sometimes you can get a tiny bit of useful data from a bad chart), it means the volatile op each 50ns is only marginally slower than the plain op each 50ns.

Okay now, we also had a few other baselines. Before we look at them, let us ask ourselves, "Was it really a good idea to subtract baseline_Plain?" The answer is, unfortunately, "No, it was not!". Here’s why, let us subtract baseline_Return42:

Double U. Tee. Ef. Increments are now faster than a baseline? It’s not surprising for the seasoned performance guys, really, because performance is not composable: there is no way to predict how two modules with known independent performances will perform together.

"Surely you are joking", someone would say, "`baseline_Return42` is only slower because it obviously does more operations than baseline_Plain, namely returning the integer constant". Fair enough, let us look at something that does even more work: baseline_ReturnV, which also reads the integer from memory before returning it:

See, it works faster! The point is, baseline measurements are also the experimental data, and they are good to be the reference for the effect you test. You can not promote them to be some golden table values you can unconditionally trust. With headlines like that, we might as well compare incrPlain vs incrVolatile difference directly:

It chimes back to our observation that volatile write costs are dramatically amortized if we are not choking the system with them. This exercise shows a few important points:

We need performance models to predict the system behavior across the wide variety of conditions;
Building performance models implicitly assumes control, and that control may show the questionable behaviors of experimental setup, as we saw above with baselines;
And most importantly, these combinatorial experiments allow us to mix the operations in different manners, and reason about their independent performance with more predictive power.

Wait, what? Surely we can omit that boring "mixing" part, and measure the effects directly? We can just wrap those volatile writes in System.nanoTime() calls and forget all the mess with the baselines and what not! Brace yourself, my hero, as you are entering the Terrain of Crushed Dreams.

Timers

It is super fun and easy to measure the code like this:

// call continuously
public long measure() {
  long startTime = System.nanoTime();
  work();
  return System.nanoTime() - startTime;
}

Where is the catch? Well, System.nanoTime() is no magic fairy dust, it needs time to execute, it may offset the results, etc. So, before we do the code like this, we need to benchmark nanoTime itself! In hindsight, we are only interested in two characteristics: latency and granularity.

Note	The remaining part would treat `nanoTime` as a black box. If you want to know how it works internally, you can start from "Inside The HotSpot VM Clocks" by David Holmes, and pick up from there.

Note	We are not taking care of `currentTimeMillis`, because its precision is beyond any joke. We skip `new Date()` and other "clever" ways to measure time for the same reason.

Latency

Latency, in this context, is the amount of time we spend in nanoTime call. It is rather easy to measure with JMH, with the benchmark code below:

@GenerateMicroBenchmark
public long latency_nanotime() {
  return System.nanoTime();
}

There is a subtle problem though: there is no reason for you to believe JMH measures the time correctly, so most paranoid people (like me) go on and look for the generated code for this benchmark:

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void latency_nanotime_AverageTime_measurementLoop(InfraControl control, RawResults result, TimerLatency_1_jmh l_bench, BlackHole_1_jmh l_blackhole) throws Throwable {
  long operations = 0;
  long realTime = 0;
  result.startTime = System.nanoTime();
  do {
    l_blackhole.consume(l_bench.latency_nanotime());
    operations++;
  } while(!control.isDone);
  result.stopTime = System.nanoTime();
  result.realTime = realTime;
  result.operations = operations;
}

Oh good, we only do two major timestamps per iteration, so we are not intervening with the testing infrastructure itself when measuring the nanoTime. Great. Moving on.

Granularity

Granularity is the minimum difference we get between consecutive nanoTime calls. Normally, you could call nanoTime over and over again and see the minimum difference, but we can go the other route. Instead, we will spin until nanoTime answers different value, which estimates granularity quite nicely:

private long lastValue;

@GenerateMicroBenchmark
public long granularity_nanotime() {
  long cur;
  do {
    cur = System.nanoTime();
  } while (cur == lastValue);
  lastValue = cur;
  return cur;
}

One-off Measurements

These tests are self-contained enough to be usable by general crowd, and hence I hacked together a small project which measures both latency and granularity in some interesting conditions. Now, if you run this project at some platforms of interest… Quite some people actually sent their interesting results to me. Stop sending the results already!

This is a typical result from a modern Linux distribution:

Java(TM) SE Runtime Environment, 1.7.0_45-b18
Java HotSpot(TM) 64-Bit Server VM, 24.45-b08
Linux, 3.13.8-1-ARCH, amd64

Running with 1 threads and [-client]:
   granularity_nanotime:      26.300 +- 0.205 ns
       latency_nanotime:      25.542 +- 0.024 ns

Running with 1 threads and [-server]:
   granularity_nanotime:      26.432 +- 0.191 ns
       latency_nanotime:      26.276 +- 0.538 ns

What can we learn from this? The latency for nanoTime call is around 30 ns, and the granularity follows suit, obviously. In our benchmark, it probably means the first call already got the increased value, and we only did a single nanoTime there. Note the greatest implication of all here: you are unable to get a direct measurement of anything shorter than 30 ns, and the lower bound on measurement error is the half of the precision, therefore, around 15 ns. Measure volatile writes which take around 15 ns on their own with that.

This is a typical result from Solaris:

Java(TM) SE Runtime Environment, 1.8.0-b132
Java HotSpot(TM) 64-Bit Server VM, 25.0-b70
SunOS, 5.11, amd64

Running with 1 threads and [-client]:
   granularity_nanotime:      29.322 +- 1.293 ns
       latency_nanotime:      29.910 +- 1.626 ns

Running with 1 threads and [-server]:
   granularity_nanotime:      28.990 +- 0.019 ns
       latency_nanotime:      30.862 +- 6.622 ns

It follows the same pattern. The clock rates are a bit different across machines, and so you spend a different time for the same number of clock ticks. There is an anecdotal evidence from some of the Linux/Solaris folks that low-power CPU states running on 800 MHz can produce latencies up to 150 ns per nanoTime call there.

Those were *nix-ish platforms. Let’s try Windows. This is a very typical Windows data point:

Java(TM) SE Runtime Environment, 1.7.0_51-b13
Java HotSpot(TM) 64-Bit Server VM, 24.51-b03
Windows 7, 6.1, amd64

Running with 1 threads and [-client]:
   granularity_nanotime:     371,419 +- 1,541 ns
       latency_nanotime:      14,415 +- 0,389 ns

Running with 1 threads and [-server]:
   granularity_nanotime:     371,237 +- 1,239 ns
       latency_nanotime:      14,326 +- 0,308 ns

Color me surprised. The latency is actually great, since the clock rates are very high. But granularity? It seems like we only have 370 ns precision, which means we can have 26 consecutive nanoTime calls, and only the 27-th will yield the new result. Which means, if you try to measure a very thin effect, you may perceive a zero duration. I think if you average across many measurements, you will still estimate the mean more or less rigorously due to Central Limit Theorem.

Anyhow, let us take a larger Windows server (a very rare beast) and see what happens when all the hardware threads are busy cranking nanoTime calls. On our test system, we had 2x8x2 = 32 hardware threads, here you go:

Java(TM) SE Runtime Environment, 1.8.0-b132
Java HotSpot(TM) 64-Bit Server VM, 25.0-b70
Windows Server 2008, 6.0, amd64

Running with 32 threads and [-client]:
   granularity_nanotime:  15137.504 +-   97.132 ns
       latency_nanotime:  15190.080 +- 1760.500 ns

Running with 32 threads and [-server]:
   granularity_nanotime:  15118.159 +-  121.671 ns
       latency_nanotime:  15176.690 +- 1504.406 ns

Bloody hell. 15 microseconds to get a timestamp! Before jumping to conclusions, let us remember our volatile write experiment from before. Aha! It seems a bad idea to drown the system with heavy-weight operations, and we need to quantify the effect when amortizing operations are present.

Backoff Measurements

We will use the same technique as before: variable backoffs. Since we have to juggle both backoff and thread count, we will have to use JMH API to construct the appropriate experiment:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(5)
@State(Scope.Thread)
public class TimerLatency {

    @Param
    int backoff;

    @GenerateMicroBenchmark
    public long nanotime() {
        BlackHole.consumeCPU(backoff);
        return System.nanoTime();
    }

    public static void main(String... args) throws RunnerException {
        PrintWriter pw = new PrintWriter(System.out, true);
        for (int b = 1; b <= 1048576; b *= 2) {
            for (int t = 1; t <= Runtime.getRuntime().availableProcessors(); t++) {
                runWith(pw, t, b);
            }
        }
    }

    private static void runWith(PrintWriter pw, int threads, int backoff) throws RunnerException {
        Options opts = new OptionsBuilder()
                .include(".*" + TimerLatency.class.getName() + ".*")
                .threads(threads)
                .verbosity(VerboseMode.SILENT)
                .param("backoff", String.valueOf(backoff))
                .build();

        RunResult r = new Runner(opts).runSingle();
        double score = r.getPrimaryResult().getScore();
        double scoreError = r.getPrimaryResult().getStatistics().getMeanErrorAt(0.99);
        pw.printf("%6d, %3d, %11.3f, %10.3f%n", backoff, threads, score, scoreError);
    }

}

The backoff time can be thought of as the time we spend in the code under measurement, and that helps to figure out where the overheads of nanoTime take the precedence over the measured code itself.

This is a Linux result from 2x12x2 Xeon (Ivy Bridge), RHEL 5, running JDK 8 GA:

Our backoffs are exponentially increasing, which means they should be at equal distances on log axis. If there is no overhead from nanoTime calls, then the line for a given backoff should be parallel to X axis, indicating the cost of "nanoTime + backoff" is constant across all the threads.

Nothing really bad in this Linux result. You can see the overheads of nanoTime creep in a bit, but these effects are only visible at lower orders of logarithmic scale. You can see lower backoffs are a bit condensed lower than 100 ns, which means the nanoTime overheads are significant there.

Now let us look at Windows result, running on 2x8x2 Xeon (Sandy Bridge), Windows 2008 Enterprise Edition, JDK 8 GA:

There. Scalability bottleneck. I like to interpret this chart as follows. Let us take some middle-ground backoff value, say the green line straight between 10³ and 10⁴ ns, and start to follow that line from left to right as if we are increasing the number of threads. It all goes fine with single thread, two threads, even four threads. But then, somewhere past five threads the execution time starts to drag upwards: we hit the nanoTime overhead!

Lower backoff values obviously experience that bottleneck sooner, and higher backoffs do that later. Looking at this graph, calling nanoTime quicker than 32 us across all threads will hit the scalability bottleneck on this particular machine. The entire white spot under the "scalability front" is a forbidden zone. You can’t measure anything there.

Note	This is why JMH-ish SampleTime mode will backoff aggressively instead of measuring every single call.

One can use this data to bash Windows hard… but really, the trouble is with supporting nanoTime monotonicity: the values users get from this call should never decrease, as if the time goes backward. A remarkable majority of user code does not expect this from the Universe.

When your OS declares some timer is monotonic, it has to face the music with hardware which may have unsynchronized time sources across CPUs. Hence, the scalability troubles. Or, even worse, you can have OS declaring the counter monotonic and use fancy HW instructions, but then failing to achieve the monotonocity due to implementation bugs, and we are only left with plugging the leak at runtime level. This is the case for Solaris running at 2x8x2 Xeon (Sandy Bridge) and JDK 8 GA:

The same scalability bottleneck, albeit significantly lighter, because we want to protect the user code from observing a retrograde time. Technically, that is a global static variable tracking the max observed value which we CAS on each nanoTime call, oh the burn…

These observations give the context for the improvement in another development platform. Behold, this is Mac OS X running JDK 8 GA today:

Java(TM) SE Runtime Environment, 1.8.0-b132
Java HotSpot(TM) 64-Bit Server VM, 25.0-b70
Mac OS X, 10.9.2, x86_64

Running with 1 threads and [-server]:
   granularity_nanotime: 1009.623 +-  2.140 ns
       latency_nanotime:   44.145 +-  1.449 ns

Running with 4 threads and [-server]:
   granularity_nanotime: 1044.703 +- 32.103 ns
       latency_nanotime:   56.111 +-  3.397 ns

The granularity is low because we use the microsecond-precision OS call. We can choose another OS call with better precision, but nowhere in the spec we find the mention about that call monotonicity. So, even though it is monotonic in all our experiments, we need to have the same safety net we have in Solaris enabled by default; lucky for us, there are not so many heavily-parallel Macs out there. -XX:+AssumeMonotonicOSTimers shoot-yourself-in-the-foot VM option which disables that safety net is under discussion. You can follow JDK-8040140 and the relevant discussion on hotspot-runtime-dev to learn more about this.

The lessons from this part are:

nanoTime is not dirt cheap; at best, you can hope for 15-30 ns per call
nanoTime is not nanosecond resolution; at best, you can hope for 30 ns resolution, and it varies wildly between platforms
nanoTime is a scalability bottleneck; up to the point that everything you measure would be the contended performance of nanoTime itself.

Good luck with monitoring things in production with nanoTime. This is why you need a controlled benchmarking if you are serious about performance: in many cases, you can not afford online timestamping on a live deployment. Unless, of course, you are being smart about that timestamping.

People would say: "So what. The parts we are measuring are coarse-grained enough to avoid these latency and scalability pitfalls". There is a pitchfork for them as well.

Omission Considerations

Gil Tene likes to talk about Coordinated Omission in the context of measuring the latency. We will direct interested readers there. In this section, we would like to talk about Omission in the context of measuring the throughput. It seems a natural problem to separate the timings of one particular interesting part of your application from the setup code. In other words, we want to measure the throughput of work itself, without the relevant setup:

public long measure() {
   long ops = 0;
   long startTime = System.nanoTime();
   while(!isDone) {
      setup(); // want to skip this
      work();
      ops++;
   }
   return ops/(System.nanoTime() - startTime);
}

Naively, this is what you do:

public long measure() {
   long ops = 0;
   long realTime = 0;
   while(!isDone) {
      setup(); // skip this
      long time = System.nanoTime();
        work();
      realTime += (System.nanoTime() - time);
      ops++;
   }
   return ops/realTime;
}

Note	This is what JMH does with `@Setup(Level.Invocation)`, and the section below discusses why it has such a scary Javadoc.

Anything bad about this code? Seems completely legit to only timestamp the work, and then use that aggregated time to compute the throughput. In our inexperienced minds, the time measured in this way will always be the relevant fraction of the entire execution time. "Relevant"? Mwahaha. The reality begs to differ. If we take both variants of the code below and run it with with increasing number of threads, we will see this for throughput:

The vertical black line is the number of available CPUs. External timestamping looks good: the throughput rises before it hits the hardware limit, and stays the same. But individual timestamps are really, really weird in growing the throughput past the number of actual executors available. What happens there?

Time-stamping the single part of workload misses what happens outside that part. That is the reason to have the timestamping in the first place. However, we can miss much more than we intended. Let us illustrate this point. Assume work takes 10ms, and setup takes 5ms, and we measure this test on the machine with 2 hardware threads:

Let’s compute the throughput. Obviously, the per-thread throughput of work is 100 ops/sec. We have two symmetric threads, and therefore the total throughput is 200 ops/sec. Now, let us over-saturate the system with two additional threads:

The same reasoning for per-thread throughput applies, but now we have four threads, so the perceived total throughput is 400 ops/sec! If you shove in more threads, you proportionally increase the perceived total throughput, even though your system is unable to handle it.

That is, on the host with C CPUs, you can’t really run more than C threads simultaneously. I mean, scheduler maintains the functional illusion of simultaneously running N (N > C) threads, but from performance perspective, you are only running C at a time.

The measurement assuming the threads are running uninterruptibly faces these issues. One could argue this is because we compute the throughput in a weird way, but really, the same applies for the average time: it would seem as if you saturate the system with more threads, but average operation time stays the same!

In a queued system this would be the difference between measuring the end-to-end response time versus measuring the processing time spent in a specific part of the processor. When the system is over-saturated and the queue is not empty, queue latencies add up. The JVM, which has lots of things executing concurrently, may further offset this, because on over-saturated system GCs would start to steal time hard from the applications, leading to the same kind of discrepancies.

The lessons from this part are:

measuring specific parts of the workload usually underestimates the time;
measuring over-saturated system brings in the delays caused by things otherwise lurking in the shadows: schedulers, GCs, etc.

Steady State Considerations

We started this post with seemingly irrelevant part about volatile writes, and let us conclude with seemingly irrelevant part about Fibonacci. You will see how far-fetching the implications from the previous sections are.

It amazes me how many people believe Fibonacci is a good primer to introduce yourself into benchmarking. It is, actually, one of the worst examples you can start with. Here is why, do you see any problems with this benchmark?

public class FibonacciGen {
  BigInteger n1 = ONE; BigInteger n2 = ZERO;

  @GenerateMicroBenchmark
  public BigInteger next() {
    BigInteger cur = n1.add(n2);
    n2 = n1; n1 = cur;
    return cur;
  }
}

I spy with my little eye this benchmark does not have steady state. That is, the time spent in each consecutive call of next() is always increasing, because we are dealing with larger and larger operands in BigInteger.add. Indeed, we can time each call and observe it:

Note that because of the havoc we have with nanoTime, we can’t readily trust the absolute numbers, but we can clearly see the upwards trend. Now, we have three crappy alternatives in measuring the workload like this:

Count how many operations we can make in N seconds (That is, do a time-based benchmark). This is the worst of all: we effectively computing the integral time across the unknown number of calls, and that gets confusing very, very quickly. In the particular instance of FibGen where the call time grows linearly, the larger the N, the proportionally slower the benchmark appears: since the average time is half the time of the longest call so far. How should one compare two non-steady-state implementations with a time-based benchmark is beyond me.
Time each operation (That is, do a fine-graned work-based benchmark). It only appears to produce legit data, unless you prove the timer latency and granularity problems are out, and omission problems do not bite you. We still capture lots of transients from the first invocations when our system adapts towards the load.
Time the first N operations (That is, do a coarse-grained work-based benchmark). This is arguably a sane middle ground for tiny benchmarks. Timer latency and granularity problems are somewhat alleviated, but transients are still there.

Note	All three options are supported by JMH. First one in the form of Throughput/AverageTime benchmarks. Second one with SingleShotTime. Third one can be arranged with the help of batching.

There is no good solution for non-steady-state benchmarks. Pick your poison, or avoid them completely.

Conclusions

System.nanoTime is as bad as String.intern now: you can use it, but use it wisely. The latency, granularity, and scalability effects introduced by timers may and will affect your measurements if done without proper rigor. This is one of the many reasons why System.nanoTime should be abstracted from the users by benchmarking frameworks, monitoring tools, profilers, and other tools written by people who have time to track if the underlying platform is capable of doing what we want it to do.

In some cases, there is no good solution to the problem at hand. Some things are not directly measurable. Some things are measurable with unpractical overheads. Internalize that fact, weep a little, and move on to building the indirect experiments. This is not the Wonderland, Alice. Understanding how the Universe works often needs side routes to explore.

In all seriousness, we should be happy our $1000 hardware can measure 30 nanosecond intervals pretty reliably. This is roughly the time needed for the Internet packets originating from my home router to leave my apartment. What else do you want, you spoiled brats?