Xiaomi's SVOR Framework Advances Video Object Removal, Wins CVPR Challenge

Xiaomi's SVOR (Stable Video Object Removal) framework won the CVPR 2026 Physical Perception Video Instance Removal challenge by solving three persistent problems in video object removal: shadow residue, motion jitter, and mask defects. The open-source framework introduces novel approaches to make video object removal more practical for real-world applications.

Xiaomi's model application team has unveiled SVOR (Stable Video Object Removal), a framework that secured first place at the CVPR 2026 Physical Perception Video Instance Removal Challenge, outperforming 17 competing teams. The project addresses three fundamental limitations that have plagued previous video object removal techniques, making it a significant step toward practical applications in consumer video editing.

What SVOR Claims to Solve

Video object removal has long been a challenging problem in computer vision. While existing methods demonstrate impressive results in controlled laboratory conditions, they consistently fail when applied to real-world videos. Three specific issues have been particularly resistant to improvement:

Shadow residue: When an object casting a shadow is removed, the shadow often remains, creating an unnatural visual artifact.
Motion jitter: Fast-moving subjects exhibit flickering or reappearing artifacts between frames, disrupting temporal consistency.
Mask defects: User-drawn or AI-generated masks are rarely perfect, causing visible artifacts at object boundaries.

Technical Innovations

SVOR addresses these challenges through three dedicated technical modules:

MUSE (Mask Union for Stable Erasure)

The MUSE module tackles motion jitter by processing objects within temporal windows rather than treating each frame independently. This approach maintains consistency across consecutive frames, eliminating the flickering artifacts that plague fast-moving subjects. By considering temporal context, MUSE can predict and compensate for motion patterns, ensuring smooth object removal even during rapid movement.

DA-Seg (Denoising-Aware Segmentation)

DA-Seg provides error-correction for imperfect masks, addressing the common issue of inaccurate object boundaries. The module implements a denoising-aware approach that can stabilize the segmentation mask even when the initial user-provided or AI-generated mask contains errors. This makes the framework more robust to real-world input conditions, where perfect segmentation is rarely achievable.

Two-Stage Training Approach

SVOR employs a curriculum learning strategy across two training stages. The initial stage focuses on learning basic object removal skills with high-quality inputs, while the second stage introduces increasingly challenging scenarios with imperfect masks and complex motion patterns. This approach gives the model strong cross-scenario generalization capabilities, allowing it to perform consistently across diverse video content.

Benchmark Results and Practical Applications

On standard benchmarks, SVOR achieves state-of-the-art results, with particularly significant improvements in handling the three target problems. The framework demonstrates substantially higher tolerance for imperfect conditions compared to existing methods, bridging the gap between research benchmarks and practical consumer applications.

The practical implications are notable. Unlike previous approaches that required carefully controlled conditions to function effectively, SVOR can handle real-world videos with varying lighting conditions, complex backgrounds, and imperfect user inputs. This makes it genuinely viable for integration into consumer video editing software, where robustness to real-world conditions is essential.

Limitations and Future Directions

Despite its impressive performance, SVOR has limitations that should be acknowledged. The framework currently performs best on single object removal scenarios. Handling multiple overlapping objects with complex interactions remains challenging. Additionally, while it improves mask robustness, extremely poor initial masks can still degrade performance.

The computational requirements also suggest that real-time implementation may be limited to high-end hardware, potentially restricting accessibility for consumer applications. Future work will likely focus on optimizing the model for edge deployment and extending its capabilities to more complex multi-object scenarios.

Open-Source Impact

Xiaomi has released the SVOR code under an Apache 2.0 license, making it freely available for researchers and developers to build upon. The release includes pre-trained models, training scripts, and inference code, lowering the barrier to entry for video object removal research.

For those interested in exploring or implementing SVOR, the official GitHub repository provides comprehensive documentation and example code. The framework's modular design also allows researchers to experiment with individual components, potentially accelerating innovation in related areas.

The SVOR framework represents a meaningful advancement in practical video object removal, demonstrating that academic research can produce results that translate well to real-world applications. By addressing the specific pain points that have limited previous methods, Xiaomi has contributed a tool that may significantly impact consumer video editing capabilities.

#video object removal #Computer Vision #SVOR #Xiaomi #CVPR