Overworld has released Waypoint-1, an open-source video diffusion model designed for real-time interaction, trained from scratch on 10,000 hours of game footage with integrated control inputs. Unlike fine-tuned models with limited latency, Waypoint-1 processes keyboard, mouse, and text prompts with zero latency, generating frames at up to 60 FPS on consumer hardware using its custom inference library, WorldEngine.
Overworld has introduced Waypoint-1, a real-time interactive video diffusion model that aims to bridge the gap between generative video and playable experiences. The model is trained from scratch on 10,000 hours of diverse video game footage, paired directly with control inputs and text captions. This approach contrasts with the common practice of fine-tuning pre-trained video models with simplified controls, which often results in high latency and limited interactivity.

What's Actually New
Waypoint-1's core innovation is its training methodology. The backbone is a frame-causal rectified flow transformer. During pre-training, the model uses diffusion forcing, where each frame is noised randomly and the model learns to denoise them separately under a causal attention mask. This allows for frame-by-frame generation during inference. However, diffusion forcing has a known limitation: the training and inference regimes don't perfectly match, leading to error accumulation over long rollouts.
To address this, Overworld post-trains the model with self-forcing, a technique that aligns the training process with the autoregressive nature of inference. This method, based on Diffusion Matching (DMD), also enables one-pass classifier-free guidance (CFG) and few-step denoising, which are critical for real-time performance.
The result is a model that can generate new frames conditioned on the user's current controls—mouse movement, keyboard presses, and text prompts—without the latency penalties seen in other systems. Where other models might update the camera position every few frames, Waypoint-1 processes controls with zero latency, making the experience feel immediate.
Practical Applications and Performance
The primary application is interactive world generation for games and simulations. Users can provide an initial set of frames and a text prompt (e.g., "A game where you herd goats in a beautiful valley"), then navigate the generated world in real-time. The model is designed to run on consumer hardware, specifically optimized for NVIDIA GPUs.
Overworld provides the WorldEngine inference library to facilitate this. WorldEngine is a high-performance Python library built for low-latency, high-throughput streaming of interactive worlds. It handles the runtime loop, consuming context frames, controller inputs, and text to output image frames.
Performance metrics for Waypoint-1-Small (2.3B parameters) on an NVIDIA RTX 5090 are notable:
- ~30,000 token-passes/sec (single denoising pass; 256 tokens per frame)
- 30 FPS at 4 denoising steps
- 60 FPS at 2 denoising steps
These speeds are achieved through several targeted optimizations in WorldEngine:
- AdaLN Feature Caching: Reuses conditioning projections when prompts and timesteps remain constant between forward passes, avoiding redundant computation.
- Static Rolling KV Cache + Flex Attention Matmul Fusion: Standard inference optimization that fuses QKV projections.
- Torch Compile: Uses
torch.compile(fullgraph=True, mode="max-autotune", dynamic=False)for graph-level optimization.
Limitations and Trade-offs
While the performance is impressive for a 2.3B parameter model, it's important to contextualize the results. The reported FPS is achieved at 2 or 4 denoising steps, which is a trade-off between speed and generation quality. Fewer steps typically result in lower fidelity or increased artifacts, though self-forcing aims to mitigate this.
The model is trained on 10,000 hours of video game footage. This domain specificity means it will likely perform best for game-like environments and may struggle with photorealistic or non-game video generation. The quality and diversity of the training data are crucial, and the model's ability to generalize to novel concepts outside its training distribution remains an open question.
Furthermore, the "zero latency" claim refers to the control processing. The generation latency itself is tied to the denoising steps and hardware. While 30-60 FPS is suitable for many interactive applications, it's not yet at the level of traditional real-time rendering engines for complex scenes.
Getting Started
The model weights for Waypoint-1-Small are available on the Hugging Face Hub. Overworld is also running a hackathon on January 20, 2026, to encourage development with WorldEngine, offering a 5090 GPU as a prize.

Comments
Please log in or register to join the discussion