Vision-Language Models Exhibit Human-Like Conflict Adaptation in Stroop Task, Revealed by Sparse Autoencoders

In a striking parallel to human cognition, most vision-language models (VLMs) now demonstrate conflict adaptation—a hallmark of cognitive control where performance improves on high-conflict trials following prior high-conflict exposures. This phenomenon, long studied in humans via the Stroop task, suggests that VLMs may be recruiting mechanisms akin to scarce cognitive resources, offering new insights into their internal representations.

Article illustration 1

The Stroop Task Meets Large-Scale AI

The research, detailed in arXiv:2510.24804v2 [cs.CV] and accepted to the Workshop on Interpreting Cognition in Deep Learning Models at NeurIPS 2025, tested 13 VLMs using a sequential Stroop task. In this setup, models classify whether a word's meaning (e.g., "red") matches its font color (congruent) or conflicts (incongruent). Human psychology shows faster responses on incongruent trials after another incongruent trial, reflecting conflict adaptation.

Remarkably, 12 of 13 VLMs replicated this behavior, with the outlier likely due to ceiling effects in accuracy. This consistency across diverse architectures underscores an emergent property not explicitly trained for, hinting at shared representational strategies in multimodal AI.

Unpacking Representations with Sparse Autoencoders

To probe deeper, the authors applied sparse autoencoders (SAEs) to InternVL 3.5 4B, a 4-billion parameter VLM. SAEs decompose activations into interpretable 'supernodes'—sparse, monosemantic features that capture task-relevant concepts.

Key discoveries include:

  • Partially overlapping supernodes for text and color in both early and late layers, reflecting how VLMs disentangle yet integrate multimodal inputs.
  • Asymmetry in supernode sizes, mirroring human automaticity: larger text supernodes (reading is faster/automatic) versus smaller color ones (color naming is slower).
  • A conflict-modulated supernode in layers 24-25. Ablating this supernode significantly boosted Stroop errors on incongruent trials while sparing congruent ones, causally linking it to conflict processing.
Supernode ablation results:
- Congruent trials: Minimal impact (~1-2% error increase)
- Incongruent trials: Substantial error rise (up to 15%)

These results suggest VLMs encode cognitive control via sparse, modular representations, challenging views of transformers as black boxes.

Implications for AI Interpretability and Cognitive Science

For developers and AI researchers, this work elevates mechanistic interpretability as a tool to reverse-engineer emergent behaviors. By identifying causal supernodes, SAEs enable targeted interventions—ablations that isolate cognitive phenomena without global model disruption.

"Partially overlapping supernodes emerge for text and color... their relative sizes mirror the automaticity asymmetry between reading and color naming in humans." — Abstract from arXiv:2510.24804

In cognitive science, these findings bridge AI and human minds: VLMs not only mimic but mechanistically approximate human conflict adaptation. As VLMs scale, understanding such 'cognitive' features will be crucial for safety, alignment, and deploying reliable multimodal systems.

The study, led by Xiaoyang Hu, invites further exploration into how training data and architectures foster these human-like traits, potentially reshaping how we build and audit next-generation AI.