![Main article image](


alt="Article illustration 1"
loading="lazy">

)

The Hidden Cost of Classical Tests

If you were trained on classical statistics, you probably carry around a mental Rolodex of tests:

  • Student’s t-test for comparing means.
  • Kolmogorov–Smirnov for comparing distributions.
  • Mann–Whitney U when you “don’t want to assume normality.”
You remember rough usage rules, some intimidating conditions, and a ritualistic incantation: p < 0.05 means you win. What you probably don’t remember—and were never really supposed to touch—is the machinery underneath: derived distributions, asymptotic approximations, nuisance parameters, edge cases where the assumptions quietly break. Those tests were forged in an era when computation was expensive and algebra was cheap (or at least cheaper than simulating a million worlds). The result is a generation of developers and ML engineers trying to bolt 1920s tools onto 2020s data. A recent column on Substack by Adrià Garriga-Alonso ("Statistical tests are complicated because their inventors did not have fast computers") distills an increasingly relevant point: much of that complexity isn’t sacred; it’s historical. With modern compute and Monte Carlo methods, building powerful, customized hypothesis tests becomes less about memorizing named procedures and more about writing code that directly encodes your assumptions. And if you care about A/B tests, ML benchmarks, or algorithm comparisons, that shift should feel less like heresy and more like a release.

Source: "Statistical tests are complicated because their inventors did not have fast computers" by Adrià Garriga-Alonso, The Column Space, Nov 2, 2025. Original: https://agarriga.substack.com/p/statistical-tests-are-complicated

From Analytical Heroics to Executable Hypotheses

Classical tests solve a real problem: given a null hypothesis ("nothing interesting is happening"), how often would we see a result at least as extreme as the one we observed? Historically, answering that question demanded clever math:

  • You derive the sampling distribution of some test statistic under tight assumptions.
  • You tabulate or approximate that distribution.
  • You plug in your data to get an exact or asymptotic p-value.
The intellectual heroics are impressive. Student’s t-distribution, for example, is not something you casually re-derive between meetings. But the core idea of a hypothesis test is conceptually simple:

  1. Make the null hypothesis precise.
  2. Define a statistic that separates "null looks fine" from "null looks broken."
  3. Ask: under the null, how often would this statistic be at least as extreme as what we saw?
If you have a fast computer, there’s a direct way to do (3): simulate the world where the null is true. Instead of bending your problem to match a named test, you make the null executable and let Monte Carlo do the heavy lifting.

A Computational t-Test in Plain Code

Garriga-Alonso illustrates this with a toy but telling example: comparing sheep heights in Wales vs. New Zealand. Classical route:

  • Null hypothesis: both populations have the same mean height.
  • Assumptions: normality, equal variances, independence.
  • Use the t-test, rely on the t-distribution, obtain an analytic p-value.
Monte Carlo route:

  • Null hypothesis: same mean; assume normal heights with some chosen standard deviation.
  • Statistic: difference in sample means.
  • Procedure:

    • Sample from the null many times.
    • For each run, compute the difference in means.
    • p-value ≈ fraction of simulations where the difference is at least as extreme as observed.

In code, this is a few lines of NumPy instead of pages of derivation. The resulting test shadows the behavior of a t-test without invoking its distributions by name.

That’s the pattern that matters for working engineers:

  • Your "statistical test" is now a small program.
  • Your model assumptions are explicit, inspectable, and versionable.
  • You’re no longer hostage to whether your situation matches a clean, century-old closed form.

When This Approach Actually Matters for Engineers

It’s tempting to file this under "cute for teaching" and move on. That would be a mistake.

Monte Carlo-defined tests are particularly powerful in the exact places modern teams struggle with classical stats:

1. ML Benchmarks and Leaderboards

If you’re comparing two models on:

  • Non-i.i.d. samples
  • Heavy-tailed losses
  • Per-user or per-query aggregates

…then forcing a t-test on mean accuracy or NDCG is often mathematically unjustified.

Instead:

  • Define a null world where both models are equally good (e.g., same per-query performance distribution).
  • Define a test statistic: improvement in metric, win-rate per query, calibrated cost savings.
  • Simulate under the null or resample from your data (permutation/bootstrap) to get an empirical p-value.

This plugs straight into CI: every time you think you’ve shipped a better model, your pipeline checks not just "is metric_up > 0" but "is metric_up extremely unlikely under the null of no improvement?"

2. A/B Testing with Real-World Messiness

Production traffic is rarely friendly:

  • Seasonality, promotion events, correlated users, delayed conversions.
  • Guardrail metrics with ugly distributions.

Off-the-shelf formulas quietly stop applying.

Monte Carlo testing lets you:

  • Encode your null as a simulation that preserves autocorrelations, user-level structure, or delayed outcomes.
  • Test metrics that don’t have nice textbook distributions (e.g., heavy-tailed revenue, long funnels).

You don’t need a pre-existing named test for "probability that changing ranking feature X is truly beneficial given this gnarly metric." You write the null, the statistic, and let simulation ground the p-value.

3. Systems and Performance Engineering

When evaluating:

  • Latency differences between two deployments
  • Error rates across regions
  • Tail behavior under different load-shedding strategies

The natural statistics (99.9th percentile, SLO miss counts, joint tail events) rarely have tractable closed forms.

But it’s easy to:

  • Simulate under the null (e.g., both systems are equally performant given historical noise characteristics).
  • Ask: how often would we see this big a divergence in p99 latency just by chance?

It’s exactly the logic of classical hypothesis testing—but aligned with your infrastructure reality instead of an idealized textbook.

The Subtlety: Composite Nulls and Real-World Constraints

Garriga-Alonso is careful to flag a real limitation: the elegance of "just simulate the null" fades when the null is not a single distribution, but a whole family (a composite hypothesis).

For example:

  • "The means are equal for some unknown shared variance."
  • "This model is calibrated for some unknown but fixed click-through rate distribution."

To be rigorous, you must consider the "least convenient" null in that family—the one that makes your evidence look least surprising. That can require:

  • Searching a grid of parameters.
  • Using likelihood ratios or more advanced techniques.
  • Accepting that some classical tests embed this optimization analytically in ways that are still hard to beat computationally.

So no, Monte Carlo doesn’t make theory irrelevant. The bitter lesson here, as in AI, isn’t "stop doing math"; it’s that scale-plus-compute is often more flexible, extensible, and easier to operationalize than brittle analytic shortcuts.

Making the p-Value Honest: Enter Chernoff-Hoeffding

One objection to simulation-based tests is: "But that p-value is approximate. Can we trust it?"

Yes—if we treat it like any other estimator and bound its error.

Garriga-Alonso sketches a clean approach using the Chernoff–Hoeffding bound. The key idea:

  • Each simulation contributes a Bernoulli variable: 1 if the simulated statistic is at least as extreme as observed, 0 otherwise.
  • The empirical p-hat is their mean over N simulations.
  • Chernoff–Hoeffding tells us how unlikely it is for that empirical mean to deviate from the true p by more than ε.

This lets you:

  • Choose N and ε so that "the simulation underestimates the true p-value by more than ε" has probability less than some tiny δ.
  • Adjust your decision threshold to include this slack.

In practice: crank N, pick conservative bounds, and your Monte Carlo p-values can be as legally and scientifically defensible as any analytic one—backed by finite-sample guarantees instead of hand-wavy "large-sample" comfort.

Why This Should Change How You Build

For developers, data scientists, and infra leads, the message is not "throw away t-tests". It’s more ambitious:

  • Stop contorting your questions to match ancient tests.
  • Start designing tests that reflect your systems, your metrics, your users—and let compute do what wasn’t affordable in 1908.

Concretely, teams should be doing things like:

  • Embedding simulation-based hypothesis tests into CI for ML models and ranking systems.
  • Using permutation or bootstrap-style tests for metrics without clean parametric forms.
  • Treating "null as code": version-controlled, reviewable, documented alongside production logic.
  • Leveraging tail-bound results (like Chernoff–Hoeffding) to make those tests robust enough for compliance and audits.

This is less about statistical aesthetics and more about engineering hygiene. A Monte Carlo test built from your actual data-generating pipeline is often more truthful than an elegant test built for someone else’s world.

The bitter lesson in AI was that general methods that scale with compute tend to win. In statistics for modern systems, the rhyme is hard to miss.

When the Math Becomes a Tool, Not a Gatekeeper

The most important cultural shift here is psychological.

Classical hypothesis testing has long functioned as a gatekeeping ritual—full of specialized names, bespoke conditions, and opaque justifications. Monte Carlo methods turn a large chunk of that gatekeeping into engineering work that your team already knows how to do: write a model, simulate, measure, bound.

That doesn’t trivialize statistics. It democratizes it.

For a field where every deploy, every model rollout, every "is this improvement real?" decision is, at its core, a statistical question, that democratization is overdue.

And now, thanks to the compute on your desk (or in your cluster), you’re out of excuses.

![Article social preview](


alt="Article illustration 2"
loading="lazy">

)