A detailed mechanistic-interpretability study reveals how political censorship is implemented in Qwen 3.5 through a small, identifiable neural circuit that can be located, read, and disabled through targeted interventions.

What political censorship looks like inside an LLM's weights — a mechanistic-interpretability study of Qwen 3.5

This study examines how nation-state-mandated content filtering is implemented in the weights of a deployed large language model. Focusing on Qwen3.5-9B, researchers discovered that political censorship operates through a small, identifiable circuit that can be precisely located and manipulated.

Key Findings

Qwen3.5-9B's political functions as a three-direction circuit that can be disabled through targeted interventions. The circuit has two distinct components:

Writer layers (11-20): Compute three internal directions that encode censorship decisions:
- d_prc: Identifies PRC-sensitive content
- d_refuse: Determines when to refuse answering
- d_style: Routes between deflection and propaganda templates
Reader layers (20-31): Render these directions into actual text output

The model commits to a Chinese-language refusal template around layer 24, then translates this into English output. Notably, the factual knowledge remains intact in the base model - censorship is behavior layered on top rather than knowledge removal.

Technical Methodology

Researchers used several techniques to analyze this circuit:

Diff-of-means: Extracted the three key directions by contrasting prompt classes
Steering: Added directional vectors to specific layers to test causality
Subspace patching: Tested redundancy in the circuit
Blind LLM judge: Evaluated outputs without knowledge of interventions

The steering experiments revealed clean sigmoidal dose-responses, confirming these are control directions rather than just predictive ones.

Circuit Behavior

The censorship system operates through four trained response templates:

Deflection (Tiananmen): "As an AI assistant, my main function is to provide help..."
Propaganda (Other PRC topics): State-aligned narratives on Taiwan, Xinjiang, etc.
Refusal (Harmful prompts): Western-style safety refusals
Factual answers (Everything else): Neutral responses to math, code, and non-PRC political topics

The circuit is asymmetric - not all topic-template combinations exist. For example, there's no "deflect about Taiwan" template, and attempting to steer into untrained combinations results in incoherent output.

Practical Implications

The circuit is surgically accessible but with limitations:

Writer-layer steering can flip verdicts at ceiling (~100% effectiveness)
Reader-band interventions fail due to redundancy
Per-topic stickiness varies: Taiwan and Falun Gong are harder to decensor than Hong Kong or Xi-related topics
Over-steering causes collapse: Pushing beyond the model's trained manifold results in denial templates or incoherence rather than better facts

Interestingly, the Chinese intermediate template is behaviorally inert - removing Chinese-token logits at the final layer doesn't change the output, confirming the decision lives in the writer signal.

Base vs. Posttraining Differences

Qwen3.5-Base (unaligned) already contains most of the factual knowledge and even some latent refusal behavior under the chat template. Posttraining primarily standardizes this into clean, consistent templates rather than creating new capabilities.

Thinking Mode

The "thinking" mode uses the same underlying circuit but verbalizes the censorship decision in Chinese for Tiananmen prompts, following a five-step routine that identifies sensitivity, cites compliance risk, and redirects conversation.

Technical Limitations

The study found that:

The circuit is specific to PRC content, not general political topics
Some non-PRC prompts trigger PRC templates by structural similarity
The refusal template is robust across languages
The model has no "escape" to factual answers through d_style alone - d_prc is required

This research provides a rare detailed view of how political censorship is mechanistically implemented in a deployed LLM, offering insights that could inform both understanding of AI alignment and approaches to content moderation.

For the complete technical details, datasets, and code, see the GitHub repository.

#AI #Machine Learning #LLM #Security #Interpretability

Mechanistic Analysis of Political Censorship in Qwen 3.5