Image Diffusion Models Unlock Zero-Shot Video Object Tracking with DRIFT Framework

Image diffusion models have long dominated the generative AI landscape, excelling at creating photorealistic images from noise. Yet, a new arXiv preprint reveals these models possess an unexpected superpower: emergent temporal propagation that powers zero-shot object tracking in videos. Titled "Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos," the paper by Youngseo Kim, Dohyun Kim, Geohee Han, and Paul Hongsuck Seo, submitted on November 25, 2025 (arXiv:2511.19936), uncovers how self-attention maps in these models act as semantic label propagation kernels.

From Static Semantics to Dynamic Tracking

At the core of this discovery lies a reinterpretation of diffusion models' self-attention mechanisms. Originally designed for image synthesis, these attentions implicitly capture rich semantic structures, enabling tasks like recognition and localization without fine-tuning. The researchers extend this by propagating labels across video frames, forming a temporal kernel for pixel-level correspondences between relevant regions.

This mechanism alone enables rudimentary zero-shot tracking via segmentation. To elevate performance, the team introduces test-time optimization strategies:

  • DDIM Inversion: Reverses the diffusion process to align features with input frames.
  • Textual Inversion: Fine-tunes text embeddings for better semantic alignment.
  • Adaptive Head Weighting: Dynamically weights attention heads for consistency.

These adaptations make diffusion features robust for label propagation, even in challenging scenarios like occlusions or motion blur.

Article illustration 1

DRIFT: A Complete Tracking Pipeline

Building on these insights, the authors present DRIFT (Diffusion-based Robust Inference for Tracking), a novel framework that integrates a pretrained image diffusion model with Segment Anything Model (SAM)-guided mask refinement. DRIFT operates entirely zero-shot—no training on video data required—yet delivers state-of-the-art results on benchmarks like DAVIS, YouTube-VOS, and AOT-Bench.

Key innovations include:

- Semantic propagation via diffusion self-attention across frames.
- Test-time optimizations for feature robustness.
- SAM for precise mask boundaries post-propagation.

The framework's success underscores a broader trend: pretrained foundation models in vision are evolving into versatile tools for perception tasks, rivaling specialized architectures.

Implications for Computer Vision Research

For developers and researchers in computer vision, DRIFT opens new avenues. Traditional trackers rely on supervised learning or optical flow, often struggling with generalization. Diffusion models, trained on vast image datasets like LAION, bring zero-shot generalization from the outset. This could democratize advanced tracking for applications in autonomous driving, surveillance, and AR/VR, where labeled video data is scarce.

However, challenges remain. Computational overhead from iterative diffusion sampling and test-time optimizations may limit real-time deployment. Future work might explore distilled versions or integration with efficient samplers like LCM (Latent Consistency Models).

As diffusion models continue to surprise, this research signals a shift: from siloed generation to multifaceted vision foundations. DRIFT not only tracks objects with unprecedented zero-shot fidelity but also invites the community to probe deeper into the latent capabilities of these probabilistic powerhouses.