OpenAI and Thrive Holdings built a tax‑preparation assistant that uses Codex to turn practitioner corrections into structured evaluation targets, enabling a loop that improves extraction accuracy over weeks. The system saved about a third of accountants’ time and raised the share of returns with ≥75 % correct fields from 25 % to 86 % in six weeks, but the gains rely on heavy engineering scaffolding and still need human oversight for ambiguous cases.
Self‑Improving Tax Agents with Codex: What the Pilot Actually Achieved

TL;DR – A joint effort between OpenAI and Thrive Holdings produced a tax‑return assistant (Tax AI) that uses Codex to automate parts of the 1040/1041 filing workflow for a network of 30+ accounting firms. By capturing practitioner corrections, turning them into evaluation datasets, and feeding those to Codex‑driven code‑generation tasks, the system lifted the fraction of returns with at least 75 % field‑level correctness from 25 % to 86 % in six weeks. The approach shows how production traces can be reused for continuous improvement, but it still depends on a sizeable engineering pipeline and cannot replace human review for edge cases.
What the announcement claims
- Tax AI processes roughly 7,000 returns per season, automating data entry and field extraction.
- Practitioners spend about a third less time per return.
- Field‑level accuracy reaches 97 % on the best‑case returns; 86 % of returns achieve at least 75 % correct fields after six weeks of operation.
- The improvement loop consists of three pillars: practitioner feedback, production traces, and a Codex‑driven iteration pipeline.
What is actually new
1. Structured production traces
The system records the full path from raw source documents (handwritten notes, PDFs, spreadsheets) through extraction, provenance tagging, mapping to the tax engine, and the final filed return. This trace is stored alongside the practitioner’s correction, allowing the team to pinpoint whether a mismatch stems from extraction, mapping, or a legitimate judgment.
2. Automated eval generation
Corrections are grouped into recurring patterns (e.g., missed “fair‑rental‑days” fields). Those patterns become concrete evaluation suites stored in a version‑controlled repo. The evals contain:
- Representative source packages.
- Expected field values.
- Regression checks to guard against accidental breakage.
3. Codex‑driven task execution
Given a bounded task description, Codex inspects the trace, proposes code changes (e.g., extending an extraction schema), runs the targeted eval, and produces a pull‑request for human review. The loop closes when the PR passes both the new eval and the regression suite.
The rental‑property example in the blog post illustrates the full cycle:
- A practitioner corrects a missed rental‑day count.
- The system records the discrepancy and groups it with similar corrections.
- An eval for “fair‑rental‑days” is created.
- Codex generates a fix, validates it, and suggests a PR.
- After approval, the fix ships and new production data provide fresh traces.
Limitations and open questions
- Engineering overhead – Setting up the trace infrastructure, eval generation, and Codex task environments required a dedicated team of engineers and researchers. Smaller firms without such resources may find the barrier too high.
- Human‑in‑the‑loop – Ambiguous corrections still fall back to engineers. The loop is not fully autonomous; it accelerates iteration but does not eliminate the need for expert review.
- Scope of automation – The current loop focuses on the extraction‑and‑mapping layer. Complex tax judgments, audit‑level reasoning, or cross‑form reconciliations remain manual.
- Benchmarking – The reported 97 % accuracy applies to “best‑case” returns; the distribution across form types (W‑2, K‑1, Schedule E, etc.) is not disclosed, making it hard to compare against existing commercial tax software.
- Generalisation – While the authors claim the pattern can be reused for Schedule C, Schedule A, and even non‑tax domains, each new area required weeks of engineering effort to define schemas, evals, and Codex task templates.
How it fits into the broader effort on self‑improving agents
The Tax AI project builds on OpenAI’s earlier work on harness engineering and the Symphony framework, which formalise how to present a problem to Codex with scoped context and validation steps. Those papers (see the OpenAI blog) describe a repeatable recipe:
- Make the task legible – expose inputs, outputs, and intermediate artifacts.
- Provide scoped tools – a read‑only view of production traces plus a writable worktree.
- Validate automatically – run targeted evals and regression suites before any code lands.
Tax AI demonstrates the recipe in a high‑stakes, regulated domain. It shows that, when the right signals are captured, Codex can move from “write code from a description” to “debug a production failure from trace data.”
Practical takeaways for builders
- Invest in traceability – Capturing provenance at field level pays off; without it, you cannot reliably turn a correction into a reproducible test.
- Automate eval creation – Grouping similar failures and exporting them as eval suites is the bridge between raw feedback and a Codex task.
- Keep the loop bounded – Restrict Codex to a well‑defined code surface (e.g., extraction schemas) and let humans review the PR. This reduces the risk of silent regressions.
- Expect a non‑trivial engineering effort – The loop is not a plug‑and‑play library; it needs custom tooling, CI pipelines, and domain‑specific schemas.
Where to learn more
- The full technical write‑up on OpenAI’s blog: Building self‑improving tax agents with Codex
- OpenAI’s harness engineering guide: https://github.com/openai/harness-engineering
- The open‑source repo that hosts the Tax AI evaluation framework (private for now, but a stripped‑down version is planned for release).
Bottom line – The Tax AI pilot validates that a disciplined production‑trace pipeline, combined with Codex‑generated fixes, can accelerate accuracy gains in a complex, document‑heavy domain. The gains are real, but the approach still demands substantial engineering scaffolding and continuous human oversight. Smaller teams may adopt the pattern incrementally—starting with trace capture and eval generation—before bringing Codex into the loop.

Comments
Please log in or register to join the discussion