Researchers at EPFL's VITA lab have developed Stable Video Infinity, an open-source AI system that overcomes 'drift' limitations to generate coherent videos lasting several minutes.

For years, generative video AI has been constrained by a frustrating limitation: even the most advanced models like Pika and RunwayML produce clips lasting just 5-20 seconds before descending into visual chaos. This universal constraint stems from a phenomenon called 'drift,' where AI gradually loses consistency in character features and scene elements frame-by-frame, resulting in incoherent outputs. EPFL's Visual Intelligence for Transportation (VITA) laboratory has now engineered a solution that fundamentally changes this landscape.
The breakthrough centers on a novel training methodology named 'retraining by error recycling.' Unlike conventional approaches that discard distorted outputs, this technique intentionally reintroduces flawed generations back into the training data. Professor Alexandre Alahi explains the concept: 'Imagine training pilots exclusively in perfect conditions, then expecting them to handle storms. Our method trains AI in turbulent visual weather.' By forcing the model to continuously learn from its own accumulating errors—such as morphing objects or unstable backgrounds—the system develops inherent resilience against degradation.
This approach powers Stable Video Infinity (SVI), which demonstrates unprecedented coherence across video sequences exceeding several minutes. Benchmarks show SVI maintaining character consistency and scene stability where competitors falter within 30 seconds. For example, a generated 90-second sequence of a walking animal preserves limb proportions and gait rhythm without disintegration, while existing tools exhibit distorted anatomy by the 25-second mark.
Technical implementation leverages a diffusion model architecture adapted for temporal consistency. Crucially, SVI operates without resource-intensive modules like memory banks, achieving efficiency through its error-recycling pipeline. The team complements this with LayerSync, a synchronization technique aligning the model's internal representations across video, image, and audio domains. This cross-modal correction allows coordinated edits—such as modifying an object's texture in one frame and having changes propagate accurately throughout subsequent frames.
Available as open-source on GitHub, SVI's repository garnered over 2,000 stars within weeks of release. Its architecture supports resolutions up to 1024x576 at 30fps. The research earned acceptance at the 2026 International Conference on Learning Representations (ICLR), validating its academic significance.
Practical applications extend beyond creative media. Autonomous vehicle simulations require long-duration, physically consistent scenarios—previously impossible with drifting video AI. Similarly, educational content generation and film pre-visualization benefit from extended coherent sequences. By tackling drift at the algorithmic level, SVI provides a foundational shift toward reliable long-form generative media.
While SVI sets new benchmarks, limitations remain. Longer generations still exhibit minor flicker artifacts, and computational demands scale with duration. However, its error-recycling paradigm establishes a scalable framework for future improvements. As generative video evolves beyond fleeting clips, EPFL's work marks a critical step toward AI that sustains narrative coherence.

Comments
Please log in or register to join the discussion