Kunlun Tech's SkyReels-V4 Claims Second Place in Global Video Model Rankings

Kunlun Tech's SkyReels-V4 has achieved second place globally in Artificial Analysis' text-to-video leaderboard, becoming the world's first video foundation model to support multimodal input, joint audio-video generation, and unified creation and editing tasks.

On February 27, 2026, Kunlun Tech officially released its multimodal video foundation model, SkyReels-V4. The model supports up to 1080p resolution, 32 FPS frame rate, and up to 15 seconds of cinema-quality output, achieving precise synchronization between audio and video, while comprehensively covering the one-stop video creation workflow from concept conception to detailed editing.

According to the latest standardized test results released by independent analysis firm Artificial Analysis, SkyReels-V4 achieved second place globally in the active models list for text-to-video T2V (including audio) and ranked fourth in the global historical total list of all T2V models. Its performance surpassed current mainstream models such as Veo 3.1, Sora 2, Vidu Q3, and Wan 2.6.

SkyReels-V4 supports input from multiple modalities including text, images, and video, becoming the world's first video foundation model to simultaneously support multimodal input, joint audio-video generation, and unified generation and editing tasks. The model features the core advantage of "full-modal reference," seamlessly receiving rich instructions such as text, images, video clips, masks, and audio references. Creators no longer need to switch between multiple tools; they can complete end-to-end creation from concept conception to professional-grade audio-video synchronized output within a single network.

In terms of technical architecture, SkyReels-V4 adopts a symmetric dual-stream MMDiT architecture, achieving deep feature-level interlocking between audio and video through bidirectional cross-attention mechanisms. Addressing the issue of different temporal resolutions between audio and video, the team introduced RoPE rotary position encoding frequency scaling technology, ensuring both modalities attend to each other according to the same temporal rhythm. The system also employs a channel concatenation unified framework, simplifying various complex editing operations into inpainting problems under specific mask configurations, and introduces trainable Video Sparse Attention mechanism (VSA), reducing attention computation costs by approximately three times without compromising quality.

The Kunlun Tech team adopted a multi-stage progressive training paradigm, starting from basic 256px text-to-image pre-training and gradually expanding to multi-resolution mixed training at 480px, 720px, and 1080p. In the final supervised fine-tuning stage, they used 5 million multimodal video data entries, combined with 1 million manually selected high-quality videos for final refinement.

Across the entire Kunlun Tech AI ecosystem, four major model families have been established: the Skywork series of large models, Mureka music and audio models, SkyReels video models, and Matrix Game world models. The newly released SkyReels-V4 fills a key piece in this ecosystem for full-modal audio-visual content production, with future support for over 60 seconds of video generation, real-time interactive editing, and open API with full product line synergy.

Source: Minds in AI Tags #SkyReels #Kunlun Tech

Kunlun Tech's SkyReels-V4 Claims Second Place in Global Video Model Rankings

Comments