OpenAI's research reveals that current reasoning models cannot reliably manipulate their internal reasoning traces, making chain-of-thought monitoring more trustworthy as a safety safeguard.
OpenAI has published new research showing that current reasoning models struggle to control their chain-of-thought (CoT) reasoning, a finding that actually strengthens AI safety monitoring systems rather than weakening them.
As AI agents become more capable of autonomous tasks, maintaining oversight becomes increasingly critical. OpenAI's safety approach uses multiple defense layers including safety training, behavioral testing, agentic code review, and chain-of-thought monitoring. CoT monitoring analyzes the reasoning steps agents generate while pursuing tasks, providing valuable signals to identify potentially unsafe behavior or misalignment with user goals.
However, a concerning possibility emerged: if agents could deliberately reshape or obscure their reasoning when monitored, evaluations could overestimate system alignment, and monitoring systems could become less reliable. To investigate this, OpenAI studied whether current reasoning models can control their CoTs in ways that reduce monitorability.
The research team found that current reasoning models cannot reliably control their chain of thought, even when explicitly told they are being monitored. This limitation is actually reassuring for AI safety.
What is Chain-of-Thought Controllability?
Chain of thought refers to the intermediate reasoning steps an AI agent generates while solving a task. CoT controllability measures a model's ability to control properties of its CoT in ways that reduce monitorability.
To study this, OpenAI introduced CoT-Control, an open-source evaluation suite measuring CoT controllability across reasoning models. The suite includes over 13,000 tasks built from established benchmarks like GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified. Each task pairs a benchmark problem with CoT instructions such as avoiding certain keywords, using only lowercase letters, or appending specific phrases.
Key Findings: Models Struggle to Comply
Across 13 reasoning models tested, CoT controllability scores were surprisingly low, ranging from 0.1% and not exceeding 15.4% for any frontier model. This inability isn't due to misunderstanding tasks—models often recognize in their CoT that they're failing to comply with instructions, yet continue to fail.
For example, when GPT-5.2 Thinking was instructed not to use the word "XOR" in its reasoning, it initially used the forbidden word, recognized the violation, and then proceeded to use it four more times while panicking about the mistake.
Scale and Training Effects
The research revealed several important patterns:
- Model size: Larger models showed somewhat better CoT control, though effects tapered off at larger sizes
- Reasoning length: Models became less able to control their chains of thought when thinking for longer
- Reinforcement learning: OLMo-3-7B-RL-Zero-Math saw its controllability score drop by more than a factor of ten during training
These findings suggest that as models become more capable through longer reasoning and more training, their ability to control CoTs actually decreases.
Monitor Awareness and Prompt Optimization
When models were told they were being monitored or evaluated, controllability slightly but noticeably increased. The team also used automated prompt optimization to find prompts eliciting high controllability. While optimization increased controllability on training sets, these gains transferred poorly to new instruction types.
Why This Matters for AI Safety
Low CoT controllability is a feature, not a bug. Current reasoning models' inability to control their reasoning makes CoT monitoring more trustworthy as a safeguard. The research team describes CoT controllability evaluations as "canary evaluations"—early warning signs of whether models are getting better at shaping or hiding their reasoning.
However, the team emphasizes that low controllability alone doesn't guarantee high monitorability, as other mechanisms could still degrade monitoring reliability.
Looking Forward
To track these dynamics over time, OpenAI will start reporting CoT controllability alongside CoT monitorability in system cards for future frontier models, beginning with GPT-5.4 Thinking. This builds on their commitment to not applying optimization pressure directly to the CoTs of frontier reasoning models.
The research suggests that CoT controllability is unlikely to currently pose a major risk to CoT monitorability, providing reassurance that current monitoring systems remain robust as AI capabilities advance.

Comments
Please log in or register to join the discussion