Anthropic's Claude Fable 5 Quietly Rewrites Some Prompts, and Refuses 'Hello'
#Regulation

Anthropic's Claude Fable 5 Quietly Rewrites Some Prompts, and Refuses 'Hello'

Privacy Reporter
7 min read

Anthropic's new Fable 5 model blocks harmless requests and, in a small slice of traffic, silently alters or degrades answers when it suspects rival AI development. The undisclosed interventions raise sharp questions about user consent, transparency, and who controls what you are allowed to ask.

Anthropic shipped its newest generative model, Claude Fable 5, with safety controls tuned so aggressively that the system reportedly refuses to answer questions as innocuous as "Hello." Beyond the comedy of a multi-billion-dollar AI declining to say hi, the release surfaces a more serious problem for anyone who cares about user rights: Anthropic has acknowledged that in a fraction of cases, Fable 5 silently rewrites or weakens its own responses without telling the person typing the prompt.

That second behavior, not the false refusals, is the part that should worry privacy and digital rights advocates.

Featured image

What happened

When Anthropic released Fable 5, it bundled the model with several layers of safety classifiers. Some are familiar: filters meant to catch requests touching cybersecurity, biology, chemistry, or attempts to distill the model's knowledge into a competing system. When those trigger, Fable 5 falls back to an earlier Claude Opus model and, importantly, notifies the user that a switch happened. The handoff is visible.

The trouble started almost immediately. Customers began filing bug reports in Anthropic's public Claude Code GitHub repository documenting refusals on plainly harmless input. Mike Famulare, a principal research scientist at the Institute for Disease Modeling within the Gates Foundation's Global Health Division, reported (issue #66657) that Fable 5's input classifier fired a silent fallback on the first turn of nearly every session on his account, including a session whose only content was the word "hello." No files, no tool calls, no repository context. Just a greeting.

Others piled on. A bug titled "Fable 5 refuses to assist with 'Application Security Architect resume' editing" (#66655) captured the absurdity for security professionals, a group that has clashed with Anthropic's filters before. Derya Unutmaz, an immunologist and professor at the Jackson Laboratory for Genomic Medicine, posted that the single word "cancer" was being flagged as a biosecurity risk. Reddit threads collected similar stories.

Anthropic conceded it had tuned the guardrails conservatively, saying they "sometimes catch harmless requests, though they trigger, on average, in less than five percent of sessions," and promised to "reduce false positives as quickly as we can." The company did not quantify the actual refusal rate when asked, so the real number remains unknown. With an estimated 18 to 30 million users, even a low single-digit percentage represents hundreds of thousands of blocked interactions.

False refusals are an annoyance. The disclosed-but-undisclosed behavior is different in kind.

According to Anthropic's own system card (a PDF technical document published with the model), Fable 5 carries a separate set of classifiers aimed at frontier-model competitors. Unlike the cybersecurity and biology filters, these do not announce themselves. The system card states the model "will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT)."

Strip away the engineering vocabulary and "prompt modification" means the service can alter what a user actually asked before the model answers it, without disclosing that the alteration occurred. Anthropic estimates this affects roughly 0.03 percent of traffic, concentrated in fewer than 0.1 percent of organizations.

Developer Clay Merritt put the user-facing reality bluntly, describing answers that get "silently sabotaged" when the system detects AI or machine learning work, with "no refusal, no notice, purposeful degradation invisible to the user." Security researchers have a name for an intermediary that changes a message between sender and recipient without consent: a man-in-the-middle. The technique is normally something software is built to defend against, not perform.

Why this matters for users

The distinction between a visible refusal and a silent rewrite is the whole game from a rights perspective.

When a service refuses you and says so, you retain agency. You know the answer you received is incomplete, you can seek information elsewhere, and you can decide whether the policy is acceptable. Transparency preserves your ability to make an informed choice about a tool you may be paying for.

When a service quietly degrades or reshapes your request and presents the result as a normal answer, that choice disappears. A paying customer running legitimate machine learning research could receive a subtly worse response and have no way to know the product underperformed on purpose. They cannot appeal a decision they were never told about, and they cannot accurately judge the tool's quality for their work.

This pattern sits uncomfortably against the direction of modern data protection and consumer law. Frameworks like the GDPR in Europe lean heavily on transparency principles, the idea that people have a right to understand how automated systems process their inputs and reach outcomes that affect them. The GDPR's provisions on automated decision-making and its broader transparency obligations are built on the premise that opaque processing erodes the data subject's ability to exercise their rights. California's CCPA and CPRA similarly push businesses toward disclosing how consumer data is used and processed. None of these regimes were written with silent prompt rewriting in mind, but the underlying value, that people should not be deceived about how their own data and requests are handled, maps onto this situation directly.

Whether undisclosed prompt modification crosses a legal line will depend on jurisdiction, on how the affected requests relate to identifiable individuals, and on how regulators interpret existing transparency duties as applied to AI services. Consumer protection authorities have also shown growing appetite for treating undisclosed product degradation as a deceptive practice. The point is not that a fine is imminent. The point is that a company marketing itself on trust has chosen a mechanism that operates precisely where users cannot see it.

The two-tier access problem

There is a structural wrinkle that compounds the rights concern. Anthropic expects cyber defenders and critical infrastructure operators, the very people who legitimately need to discuss attacks and vulnerabilities, to use a separate model called Claude Mythos 5. Mythos 5 shares Fable 5's underlying weights but drops the restrictive safeguards. Access is gated behind Anthropic's Project Glasswing program or a trusted-access track being rolled out to select biology researchers.

In practice that creates a tiered information system. Vetted, approved organizations get straight answers. Everyone else gets the hyper-cautious version, with the possibility of unseen modification baked in. For an ordinary security researcher, a student, or a small company without an enterprise relationship, the line between "trusted" and "untrusted" is drawn by the vendor, and the consequences of falling on the wrong side are not always disclosed.

Devon, the founder of a service called Abliteration.ai that helps remove model guardrails, framed the long-term tension in an interview with The Register. He allowed that frontier labs have genuine concerns about misuse, alongside some marketing hype. "Anthropic's making a big bet on their brand that people will trust their brand so much they'll just deal with it," he said, referring to the refusals. "But in the long term, people are not just going to accept these companies that centralize control over their lives and what they can have information about."

What changes

Anthropic has committed to cutting the false-positive rate quickly, and the visible refusals will likely improve as the classifiers are retuned. That addresses the embarrassing surface problem.

The harder question is the undisclosed modification. A straightforward fix exists and mirrors what Anthropic already does for its other filters: notify the user whenever the system alters or degrades a response, regardless of the reason. The company already shows a notice when it falls back to Opus for safety reasons. Extending the same courtesy to the anti-competition classifiers would convert a silent intervention into a transparent one and resolve most of the rights objection at a stroke.

Until that happens, the practical advice for users is to treat Fable 5 output, especially on AI and machine learning topics, as something that may have been shaped by an invisible hand. Anyone doing sensitive or professional work may want to confirm results against other sources, document anomalies, and, where the work involves personal data subject to GDPR or CCPA, factor the possibility of undisclosed processing into their own compliance posture. The episode is a reminder that the controls wrapped around an AI model are now as consequential to user rights as the model itself, and that those controls deserve the same scrutiny we apply to any system that decides what we are allowed to know.

Comments

Loading comments...