As AI agents move beyond simple prototypes to handle complex enterprise workflows, the limits of prompt-centric design are forcing a shift in how teams build reliable systems. This analysis argues that scalable, trustworthy agents require deterministic control flow encoded in software, paired with aggressive error verification, rather than increasingly elaborate prompt chains that break down as task complexity grows.
In a May 7, 2026 analysis, Brian argues that the push to build autonomous AI agents that handle complex, multi-step tasks has exposed a hard limit in how most teams approach agent design. For the past three years, the dominant pattern has been prompt-centric: chain together increasingly elaborate prompts, add all-caps directives to force compliance, and hope the large language model (LLM) follows the intended path. By May 2026, this approach is failing at scale. Teams building agents for enterprise workflows, from invoice processing to customer support, report that prompt chains break down as task complexity grows, with silent errors and non-deterministic outputs making systems impossible to debug or trust.
This analysis argues that the path to functional, scalable agents is not better prompt engineering, but a shift to deterministic control flow encoded in software. Reliable agents tackling complex tasks need explicit state transitions, validation checkpoints, and error handling written in code, with the LLM treated as a single component rather than the entire system.
Prompt chains suffer from fundamental properties that make them unsuitable for complex workflows. They are non-deterministic: even with temperature set to 0, LLM outputs can vary between runs, making it impossible to guarantee consistent behavior. They are weakly specified: a prompt like "summarize this 50-page report, then extract all action items, then send a Slack message to the project lead" does not define how to handle a report that is scanned poorly, an action item that is ambiguous, or a Slack API failure. There is no way to perform local reasoning on a prompt chain. In a software system, you can isolate a single function, unit test it, and verify it works as intended. For a prompt chain, testing a single step requires running the entire chain, and even then, the output is probabilistic. You cannot compose prompt chains the way you compose software libraries: if you build a 10-step prompt chain for invoice processing, you cannot reuse step 3 (extract line items) in a separate tax calculation workflow without rewriting the entire prompt, and you cannot verify that step 3 works in isolation.
Most teams realize they have hit the ceiling of prompting when they start adding directives like "MANDATORY", "DO NOT SKIP", or "YOU MUST FOLLOW THESE INSTRUCTIONS EXACTLY" to their system prompts. These are signs that you are trying to force deterministic behavior out of a non-deterministic system, a battle that only gets harder as complexity grows. Reasoning about a system becomes impossible when statements are suggestions, and functions can return "Success" while hallucinating their outputs. This is the state of most prompt-centric agent designs today.
The alternative is to build deterministic scaffolds for agents. This means writing control flow in code: explicit state transitions (if the invoice total does not match the sum of line items, retry extraction), validation checkpoints (verify that the LLM output is valid JSON that matches a predefined schema), and retry logic (if an API call fails, wait 5 seconds and retry up to 3 times). Frameworks like LangGraph and workflow orchestration platforms like Temporal are designed for exactly this use case, letting teams define agent workflows as state graphs or durable workflows, with the LLM called only for tasks that require natural language understanding or generation.
Consider a common enterprise agent use case: processing vendor invoices. A prompt-centric version might use a single prompt that tells the LLM to "read the invoice attached to this email, extract the total, line items, and vendor name, then upload the data to the accounting system". This has multiple failure modes: the LLM might misread the total, skip a line item, or hallucinate a vendor name. If the upload to the accounting system fails, the LLM might not handle the error, and the team would only find out when the vendor complains about non-payment.
A control flow-centric version handles this entirely differently. The workflow is defined in code:
- Use the Gmail API to fetch the latest unprocessed invoice email (code, no LLM involved).
- Download the attachment and extract text using OCR (code, or a small LLM call for low-quality scans).
- Call the LLM with a prompt that specifies a strict JSON schema for line items, total, and vendor name (LLM handles the extraction task it is good at).
- Validate the LLM output: check that the sum of line items matches the total, that the vendor name exists in the approved vendor list, and that the output is valid JSON (code check, no LLM needed).
- If validation fails, retry the LLM call with the specific error (e.g., "Sum of line items is $100, but total is $120, please re-extract") up to 2 times.
- If retries fail, escalate to a human accountant (code handles escalation).
- If validation passes, use the accounting system API to upload the invoice data (code, with error handling for API failures).
Every step of this workflow is logged, errors are caught immediately, and the team can trace exactly where a failure occurred. This is the difference between a prototype that works 70% of the time and a production system that works 99.9% of the time.
Deterministic orchestration is only half the battle. LLMs are prone to silent failures: they can return plausible-sounding but incorrect outputs that pass basic checks. An agent without aggressive programmatic verification is just a fast way to propagate errors. The original analysis of agent design points to three common, flawed approaches to handling this risk:
- Babysitter: Keep a human in the loop for every step of the agent workflow. This works for low-volume prototypes but does not scale to enterprise use cases where agents process thousands of tasks per day.
- Auditor: Perform exhaustive end-to-end verification of all agent outputs after the workflow completes. This catches errors, but only after they have already propagated, potentially causing downstream damage (e.g., sending an incorrect refund to a customer).
- Prayer: Accept agent outputs without verification, trusting that the LLM got it right. This is only viable for non-critical tasks, and even then, the risk of costly errors is high.
The alternative is to build verification into every step of the control flow. For the invoice example, this means checking the sum of line items against the total, verifying the vendor is approved, and validating the JSON schema before passing data to the accounting system. For a customer support agent, it means checking that a generated response does not contain sensitive customer information, and that a refund request is actually eligible before processing it.
This shift from prompt-centric to control flow-centric agent design mirrors earlier shifts in software engineering. Early web development relied on messy, inline scripts with no clear structure; the move to frameworks with explicit routing, state management, and error handling made web applications scalable and reliable. Agent engineering is undergoing the same transition. Prompt engineering is still a valuable skill, but it is only one part of building a reliable agent. The core logic of the agent must live in code, not in prose.
There are trade-offs to this approach. Building deterministic control flow requires more upfront engineering time, and teams need software engineering expertise in addition to prompt engineering skills. For simple, narrow tasks (e.g., summarizing a single document), prompt chains are still faster to prototype and deploy. But for any agent that handles multi-step workflows, processes sensitive data, or needs to run at scale, the investment in control flow pays off in reliability and debuggability.
Early autonomous agents like AutoGPT relied almost entirely on prompt chains, and while they were impressive demos, they rarely made it to production for complex tasks. Newer agent frameworks are moving toward control flow-first design, recognizing that the LLM is a tool, not the system itself. The OpenAI Prompt Engineering Guide still recommends clear, specific prompts, but notes that for complex workflows, combining prompts with code-based control flow is necessary for reliability.
The core takeaway is simple: if your agent relies on all-caps prompts to function, you have outgrown prompt-centric design. The path to reliable agents is not more elaborate prompts, but more deliberate software engineering. Encode your control flow in code, treat the LLM as a component, and build verification into every step. This is how agents move from demos to production systems that teams can trust.
Comments
Please log in or register to join the discussion