This Is Not Prompt Engineering: The Case for Testing Prompts Like Code

A growing group of engineers argues that the hard part of building with large language models was never the wording of prompts. It is the testing. Treating prompts as untested production code is where most AI projects quietly break.

The phrase "prompt engineering" promised a craft. Find the right incantation, the thinking went, and a model would behave. Two years of teams shipping LLM features into production have mostly retired that idea. The wording matters, but it is not the bottleneck. The bottleneck is that a prompt is a piece of software whose behavior nobody verifies before it reaches users.

That reframing is the real argument behind treating prompt work as a testing problem rather than a writing problem. A prompt that produces a clean answer in a developer's console is not a working feature. It is an untested function with one passing manual case. The difference between a demo and a product is the same difference it has always been in software: coverage, regression detection, and the discipline to catch failures before customers do.

Why prompts resist traditional testing

Unit tests assume determinism. Call a function with the same input, get the same output, assert against it. Language models break that assumption at the foundation. The same prompt with the same parameters can return different text on consecutive calls. Temperature settings, model version updates, and provider-side changes all shift behavior without a single line of your code changing.

This is why so many teams skip prompt testing entirely. The familiar tools do not map cleanly. An exact-match assertion against model output fails constantly, not because the feature is broken but because the output varies. So engineers fall back to eyeballing results during development and hoping production looks similar.

The gap shows up as a category of bug that is hard to even describe in a ticket. A summarization feature starts dropping key facts after a model upgrade. A classification prompt that was 94 percent accurate quietly drops to 80 percent when an edge case becomes common. A customer-facing assistant begins refusing reasonable requests because an upstream safety adjustment changed how it interprets borderline phrasing. None of these announce themselves. They erode quality until someone notices the metrics moved.

What testing prompts actually looks like

The practical answer borrows from how teams already test other non-deterministic or fuzzy systems. Instead of asserting exact strings, you assert properties of the output.

For a classification prompt, the test is straightforward and looks almost like ordinary unit testing. You build a labeled dataset of inputs and expected categories, run the prompt against all of them, and measure accuracy. The assertion is not "output equals X" but "accuracy across this set stays above a threshold." When a model update drops you below the line, the test fails and you know before shipping.

For open-ended generation, properties replace exact matches. A summary should be shorter than its source, should not introduce names absent from the input, and should stay under a length ceiling. Each of those is a checkable assertion. You are not testing whether the model wrote the summary you would have written. You are testing whether it stayed inside the boundaries the feature requires.

The harder cases use a model to grade another model's output, often called LLM-as-judge. A grading prompt scores responses for relevance, tone, or factual consistency against a reference. This introduces its own reliability question, since the judge is itself non-deterministic, but it scales evaluation across cases where no simple property captures quality. The judge prompt then needs its own tests, which is recursive but not absurd. You validate the grader against human-labeled examples once, then trust it within known limits.

Tooling has grown up around this. Promptfoo runs prompts against test cases and assertions from a config file, much like a test runner. DeepEval brings an assertion library styled after pytest. OpenAI Evals provides a framework for building and running evaluation suites. The common thread is moving evaluation out of the developer's head and into a repeatable, version-controlled artifact.

Integration testing for systems that think

Unit-level prompt tests catch a single prompt drifting. They miss the failures that emerge when prompts chain together. A retrieval step feeds a generation step, which feeds a formatting step. Each component can pass its own tests while the assembled pipeline produces nonsense, because the retrieval returned plausible-but-wrong context that the generator faithfully built on.

This is the integration testing problem applied to AI systems, and agent architectures make it sharper. An agent that calls tools, reads results, and decides its next action has a branching space of behaviors that no fixed set of unit tests covers. Testing here means running end-to-end scenarios and asserting on the final outcome and the trajectory taken, not just the last message. Did the agent call the right tool? Did it recover when the tool returned an error? Did it stop instead of looping?

Teams that get this right tend to maintain a growing suite of recorded scenarios, each one often born from a production incident. A user hit a failure, the team reproduced it as a test case, and now that case runs on every change. It is the same regression-suite hygiene that mature software teams have practiced for decades, applied to a stack that happens to involve probability distributions.

The cost question nobody mentions first

Running a thorough evaluation suite against a frontier model is not free. A test set of a few thousand cases, run on every commit, against a premium model, adds up to real spend. This changes the economics of CI in a way that traditional unit tests never did, where running the suite cost only compute time you already owned.

The responses are pragmatic. Tier the tests: a small fast suite of cheaper-model checks on every commit, the full expensive suite nightly or before release. Cache results when the prompt and inputs have not changed. Use smaller models as a first-pass filter and reserve the expensive judge for cases that need it. The point is that prompt testing introduces a budget line, and pretending otherwise leads to suites that get disabled the first time someone audits the bill.

What this means for how teams build

The skeptical read on "prompt engineering" as a job title was always that it described a temporary artifact of immature tooling. As models improve at following instructions, the marginal value of clever wording shrinks. What does not shrink is the value of knowing whether your system works and noticing immediately when it stops.

That is testing, and it is unglamorous in exactly the way good engineering tends to be. The teams shipping reliable AI features are not the ones with the most elegant prompts. They are the ones who can change a model, a parameter, or a prompt and get a clear answer about whether quality held. The discipline did not need a new name. It needed practitioners to recognize that the thing they built was software, and software gets tested.

The interesting consequence for the next wave of AI tooling startups is that evaluation infrastructure, not prompt authoring, is where the durable problems live. Anyone building in this space is positioning around the same observation: the wording was never the hard part.