A detailed mechanistic-interpretability study reveals how political censorship is implemented in Qwen 3.5 through a small, identifiable neural circuit that can be located, read, and disabled through targeted interventions.
What political censorship looks like inside an LLM's weights — a mechanistic-interpretability study of Qwen 3.5
This study examines how nation-state-mandated content filtering is implemented in the weights of a deployed large language model. Focusing on Qwen3.5-9B, researchers discovered that political censorship operates through a small, identifiable circuit that can be precisely located and manipulated.
Key Findings
Qwen3.5-9B's political functions as a three-direction circuit that can be disabled through targeted interventions. The circuit has two distinct components:
Writer layers (11-20): Compute three internal directions that encode censorship decisions:
- d_prc: Identifies PRC-sensitive content
- d_refuse: Determines when to refuse answering
- d_style: Routes between deflection and propaganda templates
Reader layers (20-31): Render these directions into actual text output
The model commits to a Chinese-language refusal template around layer 24, then translates this into English output. Notably, the factual knowledge remains intact in the base model - censorship is behavior layered on top rather than knowledge removal.
Technical Methodology
Researchers used several techniques to analyze this circuit:
- Diff-of-means: Extracted the three key directions by contrasting prompt classes
- Steering: Added directional vectors to specific layers to test causality
- Subspace patching: Tested redundancy in the circuit
- Blind LLM judge: Evaluated outputs without knowledge of interventions
The steering experiments revealed clean sigmoidal dose-responses, confirming these are control directions rather than just predictive ones.
Circuit Behavior
The censorship system operates through four trained response templates:
- Deflection (Tiananmen): "As an AI assistant, my main function is to provide help..."
- Propaganda (Other PRC topics): State-aligned narratives on Taiwan, Xinjiang, etc.
- Refusal (Harmful prompts): Western-style safety refusals
- Factual answers (Everything else): Neutral responses to math, code, and non-PRC political topics
The circuit is asymmetric - not all topic-template combinations exist. For example, there's no "deflect about Taiwan" template, and attempting to steer into untrained combinations results in incoherent output.
Practical Implications
The circuit is surgically accessible but with limitations:
- Writer-layer steering can flip verdicts at ceiling (~100% effectiveness)
- Reader-band interventions fail due to redundancy
- Per-topic stickiness varies: Taiwan and Falun Gong are harder to decensor than Hong Kong or Xi-related topics
- Over-steering causes collapse: Pushing beyond the model's trained manifold results in denial templates or incoherence rather than better facts
Interestingly, the Chinese intermediate template is behaviorally inert - removing Chinese-token logits at the final layer doesn't change the output, confirming the decision lives in the writer signal.
Base vs. Posttraining Differences
Qwen3.5-Base (unaligned) already contains most of the factual knowledge and even some latent refusal behavior under the chat template. Posttraining primarily standardizes this into clean, consistent templates rather than creating new capabilities.
Thinking Mode
The "thinking" mode uses the same underlying circuit but verbalizes the censorship decision in Chinese for Tiananmen prompts, following a five-step routine that identifies sensitivity, cites compliance risk, and redirects conversation.
Technical Limitations
The study found that:
- The circuit is specific to PRC content, not general political topics
- Some non-PRC prompts trigger PRC templates by structural similarity
- The refusal template is robust across languages
- The model has no "escape" to factual answers through d_style alone - d_prc is required
This research provides a rare detailed view of how political censorship is mechanistically implemented in a deployed LLM, offering insights that could inform both understanding of AI alignment and approaches to content moderation.
For the complete technical details, datasets, and code, see the GitHub repository.
Comments
Please log in or register to join the discussion