A developer’s sigh of relief after a successful deployment often hides hidden fragility. This article explains why “it works” is a weak guarantee, illustrates common patterns of incidental correctness, and shows how systematic testing and refactoring can turn shaky code into dependable services.

When “It Works” Is Not Enough

The Problem: “It works” masks hidden brittleness

A teammate announces it works and the team breathes easy. The bug is gone, the feature is live, the demo passes. That moment feels like a win, but the statement hides a critical gap: works only describes a single observation, not a contract.

Reliability, predictability, and testability are properties that survive variation—different loads, data shapes, or deployment environments. Without them, the code is incidentally correct: it produces the right output for the inputs it has seen, but nothing guarantees it will keep doing so when conditions change.

Typical manifestations of incidental correctness

Symptom	Why it looks correct	What breaks under pressure
Two bugs cancel each other	Function A returns a wrong value that Function B expects	Fix either bug and the whole flow collapses
Edge‑case blind spot	Current test set covers only happy‑path inputs	Unexpected formats, larger payloads, or different locales produce silent errors
Implicit index reliance	Query returns rows in the expected order because a specific index exists	Dropping the index changes ordering, causing downstream failures
Timing assumptions	A retry loop usually succeeds within three attempts	Spike in downstream latency exhausts retries, leading to timeouts
Eventual‑consistency window	Reads typically happen after writes have propagated	High write traffic makes stale reads visible to users

Each of these cases passes a manual “run‑it‑once” check, yet none offers a guarantee that the behavior will hold when traffic grows, data evolves, or the environment shifts.

The danger of building on shaky foundations

When a module behaves incidentally, developers downstream treat it as a stable primitive. They compose new functions, add features, and embed the original code deep into the call graph. The hidden assumption becomes a structural strut; any change that alters the accidental behavior can cause cascading failures.

Because the original contract was never explicit, the cost of fixing the problem later explodes:

Discovery often follows an incident – customers notice the failure before the team does.
Refactoring touches many dependents – every caller must be examined, tested, and possibly rewritten.
Knowledge loss – the engineers who understood the quirk may have moved on, leaving a knowledge gap.

The result is a module that everyone avoids touching, not because it is inherently complex, but because its true behavior is opaque.

The Solution: Characterization tests followed by intentional refactoring

Capture current behavior – Write characterization tests that lock in what the code does do today, even if the behavior is undocumented. These tests become a safety net for any later changes.
Identify the intended contract – Decide what the function should guarantee (e.g., idempotent, order‑preserving, locale‑independent).
Replace accidental tricks with explicit logic
- Swap a lucky regex for a well‑named parser.
- Add an explicit ORDER BY clause instead of relying on an index.
- Introduce a proper money type that handles currency precision rather than floating‑point arithmetic.
- Use a deterministic lock or a version token to eliminate race conditions.
Run the test suite – The characterization tests confirm that the refactor did not break existing callers, while new unit/integration tests verify the intended contract.
Iterate – As the codebase stabilizes, replace the temporary tests with more focused specifications.

With this workflow, the code evolves from it works to it works by design.

Trade‑offs and when “it works” may be acceptable

Situation	Reason to defer refactor	Risk level
Low‑traffic internal tool	Limited impact, no immediate budget	Low – but documentation of the fragility is still needed
Critical customer‑facing service	High load, many downstream dependencies	High – invest in tests and refactor now
Prototype for a short‑lived hackathon	Time constraints outweigh long‑term stability	Medium – clearly label the code as experimental

Even when resources are tight, the cost of ignorance should be recorded. A brief note such as “relies on index ordering; add ORDER BY before scaling” makes the hidden assumption visible to future maintainers.

Turning the defensive “it works” into an engineering decision

When a teammate says it works as a reason to avoid change, ask for the missing context:

What exact guarantees does the code provide?
Which edge cases have been verified?
Are there tests that would fail if the accidental property changed?
What is the expected cost of a future failure?

If the answer is “no tests, no contract, just a lucky run,” the appropriate response is to allocate time for characterization tests before any refactor.

Bottom line

It works is the floor of software quality. A system built only on that floor will surprise you when the hidden assumptions are violated. By investing in tests that capture current behavior and then refactoring toward explicit contracts, teams raise the ceiling—delivering software that not only runs today but continues to run under the unknown conditions of tomorrow.

If you’re interested in practical steps for adding characterization tests to an existing codebase, check out the Testing Pyramid guide and the property‑based testing library Hypothesis.