A new benchmark, DELEGATE-52, shows that leading language models introduce errors in up to a quarter of document content during long‑term delegated tasks, raising concerns about their reliability as autonomous assistants.

LLMs Corrupt Your Documents When You Delegate

Paper: LLMs Corrupt Your Documents When You Delegate – Philippe Laban, Tobias Schnabel, Jennifer Neville (arXiv:2604.15597, submitted 17 Apr 2026)

The problem: trusting LLMs with document‑heavy work

Large language models are increasingly being used as delegates—software agents that take over parts of a professional’s workflow, from writing code to editing scientific manuscripts. The promise is simple: hand the model a task, let it act, and get a finished product without having to micromanage each step.

In practice, that promise hinges on a single assumption: the model will preserve the integrity of the original document while making the requested changes. If the model silently introduces mistakes, the downstream impact can be severe, especially when the output feeds into regulatory filings, software releases, or academic publications.

Introducing DELEGATE-52

To test that assumption, the authors built DELEGATE-52, a benchmark that simulates long‑term delegated workflows across 52 professional domains, including:

Software development (vibe coding, refactoring, dependency updates)
Crystallography data preparation
Music notation editing
Legal contract revision
Scientific manuscript formatting

Each scenario starts with a clean source document, then issues a sequence of realistic edit instructions. The model must apply each instruction while keeping the rest of the file unchanged. The benchmark records the cumulative error rate after the full instruction chain.

What the experiment found

The authors evaluated 19 publicly known LLMs, ranging from open‑source models to the latest commercial offerings (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4). The key findings are:

Model family	Average corruption rate*
Gemini 3.1 Pro	24 %
Claude 4.6 Opus	26 %
GPT 5.4	25 %
Older open‑source models	38 % – 62 %

*Corruption is measured as the proportion of tokens that differ from the ground‑truth reference after the full workflow, excluding intentional edits.

Even the most advanced models introduced errors in roughly one‑quarter of the document content when the workflow stretched beyond a handful of interactions. Errors were typically sparse but severe: a single misplaced character in a code file could break compilation, while an altered numeric value in a scientific table could invalidate results.

Why tool use didn’t help

One hypothesis was that giving models access to external tools—such as a Python interpreter, a file diff utility, or a syntax checker—might reduce mistakes. The authors ran a second set of experiments where the same models could call these tools autonomously. The results showed no statistically significant improvement; in some cases, tool use even increased the error rate because the models generated malformed tool calls that were silently ignored.

Factors that worsen degradation

The paper isolates three variables that amplify corruption:

Document size – larger files give the model more context to lose track of, leading to drift.
Interaction length – each additional edit instruction compounds the chance of a slip.
Distractor files – when the workspace contains unrelated files, models sometimes copy or delete the wrong content.

These findings line up with anecdotal reports from developers who have observed “ghost edits” after long chat‑based refactoring sessions.

Implications for the industry

The study does not claim that LLMs are useless for delegation, but it does suggest that current systems are not ready for unsupervised, high‑stakes document work. Companies building AI‑assisted IDEs, legal‑tech platforms, or scientific writing assistants should consider:

Adding explicit verification steps after each model‑generated edit.
Restricting the scope of delegation to small, well‑defined patches.
Building robust rollback mechanisms that can restore a known‑good version if corruption is detected.

In short, the trust model for delegated LLM work needs to be conditional, not absolute.

Next steps for research

The authors propose two avenues for improvement:

Self‑audit capabilities – training models to flag edits that deviate from the original semantics, perhaps by comparing abstract syntax trees for code or structural representations for LaTeX.
Curriculum‑style fine‑tuning – exposing models to long‑chain edit sequences during training so they learn to maintain context over many steps.

Both approaches require new datasets and evaluation pipelines, and DELEGATE-52 itself could serve as a benchmark for future iterations.

The full paper, including the benchmark code and dataset, is available on arXiv: 2604.15597.

#LLMs #Benchmark #document integrity #AI reliability #Software Development

Study Finds Current LLMs Frequently Corrupt Documents in Delegated Workflows