Researchers have discovered that image diffusion models, renowned for generation tasks, harbor emergent abilities for semantic label propagation, enabling zero-shot object tracking in videos. The new DRIFT framework leverages these capabilities alongside SAM-guided mask refinement to achieve state-of-the-art performance on video object segmentation benchmarks. This breakthrough hints at untapped potential in pretrained diffusion models for complex vision tasks without additional training.