#Dev

Test-Case Reducers: The Debugging Tool Most Programmers Skip

AI & ML Reporter
7 min read

Laurence Tratt makes the case that automated test-case reducers, long treated as a compiler-author specialty, are general debugging tools almost anyone can use. The surprising part is how a tool that understands nothing about your program can still shrink a failing input by 99% and let the bug fall out.

Most programmers have a debugging toolkit they reach for by reflex: printf, a debugger, and for the desperate, sanitizers or valgrind. In a recent post, Laurence Tratt argues that one tool belongs on that list far more often than it appears: the test-case reducer. His pitch is direct. These tools are simple enough that you can write a basic one yourself, they routinely shrink failing inputs by 95 to 99%, and they are not just for compiler writers.

What a reducer actually does

The setup is familiar. You have a program that fails on a large input, and you don't know which part of the input is responsible. Manual reduction works in principle: load the file in an editor, cut a chunk, check whether it still fails. In practice humans are bad at it. We miss obvious cuts, we accidentally turn the crash into a clean run or a different error, and we hit cases where deleting region A does nothing but deleting disjoint regions A and B together does. The search space of possible deletions grows fast.

A test-case reducer automates that loop. You give it three things: the program, the input, and an interestingness test. The interestingness test is a small script that returns 0 if the reduced input still triggers the bug you care about, and non-zero otherwise. The reducer keeps trying shorter inputs and uses the interestingness test as its only signal about whether each attempt is worth keeping.

Tratt's key observation, and the one that makes the whole thing click, is that the reducer has no understanding of your test or your program. It does something useful precisely because it does not need to know why it is useful. That ignorance is the feature, not a limitation. The same reducer that shrinks a C compiler crash will shrink a text file or a Python script, because it never modeled any of them in the first place.

Writing one in a few lines

To prove there is no magic, Tratt builds a toy example. A Python program reads words from a file and prints a warning when a line exceeds 25 characters. Pretend that warning is the "bug" you can't explain. The interestingness test is a short shell script that runs the program and greps for the warning string, exiting 0 on a match.

The reducer itself is roughly fifteen lines of Python. It loads the input as a list of lines, then walks through them one at a time, building a candidate with line i removed. It writes the candidate to a temporary file, runs the interestingness test, and if the test still passes, it keeps the shorter version. Run against /usr/share/dict/words, the naive reducer eventually prints a single word: antidisestablishmentarianism, the one entry longer than 25 characters. The implementation is painfully slow (three hours on his desktop for that dictionary), but it works, and that is the point.

A one-line change makes it noticeably better. Each time a reduction succeeds, reset the loop index to zero so the reducer retries lines it previously couldn't remove. That runs about ten times longer but cuts more. This is the general pattern: every obvious improvement you can imagine has already been built into off-the-shelf tools.

Reaching for the real tools

Tratt's preferred reducer is Shrink Ray, written by David MacIver of Hypothesis fame. It ships sensible reduction rules, runs interestingness tests in parallel, and has a UI worth watching. He deliberately ran it with --no-clang-delta to disable its C-specific knowledge, keeping it as ignorant as his hand-rolled version, and it still reduced a random 78-line C program by over 60% by bytes in about fifteen minutes. With language-aware rules turned on it does better still.

Two behaviors stand out. Shrink Ray knows common comment syntaxes and strips them early. More surprisingly, it reduces integer literals to smaller values, which often makes a bug far easier to read. The experience Tratt describes will be familiar to anyone who has used these tools: the reducer grinds along, then finds one cut that suddenly unlocks a cascade of further cuts. In one recent session a segfault input dropped 90% after twenty minutes, then kept shrinking to a 99% reduction, and the bug he had been staring at for ages finally became obvious.

The hard part is the interestingness test

The reducer is easy. The interestingness test is where the real difficulty lives, and most of Tratt's article is practical advice earned the hard way.

The central trap is over-reduction. The reducer is a literal-minded driver of your test, and if the test accepts inputs it shouldn't, the reducer will happily reduce past the point you wanted. Shrink Ray explicitly checks whether your test accepts an empty input, a guardrail Tratt admits he trips often. His C example checks not just that the fast and slow builds produce different output, but that the slow build produces one specific expected hash. Comparing the two outputs directly would let meaningless, misleading differences count as the bug.

Speed matters more than people expect. A reducer can run the interestingness test hundreds of times a second, and a moderate input can require hundreds of thousands of attempts, so a slow test dominates the whole run. Some of his optimizations are unconventional: he once sped a test up roughly 3x by disabling automatic core dump creation. Timeouts deserve thought too. Setting a conservative 60-second timeout on a program that normally finishes in 0.1 seconds slows reduction by orders of magnitude. He measures the program first, then sets a timeout around 1.5 to 2x its initial runtime.

There is also a subtle failure mode: reducers will cheerfully delete a line like i -= 1 and turn a terminating program into a non-terminating one. If your reducer seems stuck, this is usually why.

Bending the reducer toward other goals

The most interesting section covers using interestingness tests to optimize for something other than raw input length.

For nondeterministic bugs, Tratt exploits a happy accident: a reducer that strips out a random.random() call turns a flaky failure into a deterministic one. To encourage this, he writes a test that runs the input several times and accepts it if the bug appears at least once. That gets reduction moving but tolerates, and can even increase, nondeterminism. A stricter test that demands the bug appear on every one of n runs would be ideal, except it almost never passes Shrink Ray's initial check when the bug only fires a third of the time. His workaround is pragmatic: start with the loose "at least once" test, watch for the moment reduction has driven the failure rate up, then swap in the strict "every time" test to lock in that progress. He waits to get lucky, then changes the rules so he stays lucky.

The cleanest idea is what he calls the global-counter technique, and he introduces it with an unusually honest disclaimer, calling it the worst code he has ever knowingly published. When debugging his yk JIT, he cares about the length of the generated trace, not the input. So his interestingness test records the smallest trace seen so far in /tmp/global_best and rejects any input whose trace is even one line longer. It is unsound under parallel reduction and makes shaky assumptions about the reducer, but he doesn't care: these are throwaway scripts. The payoff was concrete. A segfault that produced unmanageable 40,000-line traces was steered down to 10,100 lines, and he found the bug within half an hour. The same trick generalizes to driving down wall-clock time, nondeterminism, or whatever proxy you can measure.

Why this is worth knowing

Underneath the tooling, a reducer is a hill-climbing search that uses input length as its fitness function. That framing explains both its power and its limits: it gets stuck in local optima, and shorter is not always closer to the bug. The technique Tratt is really teaching is that you can encode additional objectives into the interestingness test and quietly redirect the search, treating a length-minimizing tool as a general-purpose optimizer.

The broader message is a corrective to a stereotype. Test-case reducers got their reputation among compiler authors, a group many developers regard as an unreachable elite, and that association has kept the tools out of ordinary toolkits. Tratt's point is that the entry cost is a fifteen-line script and a shell wrapper, and the technique applies far beyond compilers. The history reflects how niche this stayed: he traces the lineage from RAGS through ddmin to creduce, and notes he had not heard of the first two until researching the post. For practical use today, Shrink Ray and creduce are the tools to start with, and the real skill is learning to write interestingness tests that say exactly what you mean.

Comments

Loading comments...