AutoShorts: An Open-Source Tool for AI-Powered Vertical Video Generation from Gameplay Footage

A new open-source project called AutoShorts automates the creation of vertical short clips from long-form gameplay videos using AI scene analysis and GPU-accelerated rendering. The tool identifies engaging moments, crops footage to 9:16 aspect ratio, and can add subtitles or AI voiceovers, targeting content creators who need to repurpose gameplay for platforms like TikTok and YouTube Shorts.

The challenge for many gaming content creators is the labor-intensive process of editing long gameplay sessions into short, engaging clips suitable for vertical platforms. Manually scrubbing through hours of footage to find highlights, then cropping and adding subtitles is time-consuming. An open-source project named AutoShorts aims to automate this pipeline using AI and GPU acceleration.

What's Claimed

AutoShorts is presented as a system that automatically generates "viral-ready" vertical short clips from long-form gameplay footage. The core promise is to use AI to analyze videos and identify the most engaging moments—such as action sequences, funny fails, or clutch achievements—then automatically crop, render, and add subtitles or AI voiceovers to create content ready for upload.

The project highlights several key components:

AI-Powered Scene Analysis: Support for multiple providers (OpenAI's GPT models and Google Gemini) with different analysis modes (action, funny, highlight, or mixed).
Subtitle Generation: Two modes: speech transcription using OpenAI Whisper, or AI-generated contextual captions for gameplay without voice. It includes multiple caption styles and integrates with PyCaps for visual templates.
AI Voiceover: A local text-to-speech system called ChatterBox TTS that runs without cloud APIs, offering emotion control and support for over 20 languages.
GPU-Accelerated Pipeline: Utilizes CUDA for scene detection, audio analysis, and rendering, with a fallback to CPU-based methods if GPU components fail.
Smart Video Processing: Ranks scenes by a combined audio and video action score, with configurable aspect ratios (default 9:16) and smart cropping for non-vertical footage.

What's Actually New

AutoShorts is not the first tool to attempt automated short-form video creation from long footage. It builds upon existing concepts and open-source work, as noted in its acknowledgments to projects like artryazanov/shorts-maker-gpu and Binary-Bytes/Auto-YouTube-Shorts-Maker. The novelty lies in its specific integration and feature set:

Modular AI Provider Support: Unlike tools tied to a single AI service, AutoShorts allows users to choose between OpenAI and Google Gemini for scene analysis, or fall back to a local heuristic scoring system. This flexibility is practical for cost management and reliability.
Local, GPU-Accelerated TTS: The inclusion of a local TTS engine (ChatterBox) is significant. Many automated video tools rely on cloud-based voice synthesis APIs, which incur ongoing costs and latency. Running TTS locally on a GPU, with emotion and multilingual controls, offers more control and privacy, though it requires specific hardware.
Robust Fallback System: The project is designed with failure in mind. It explicitly defines fallback paths for every major component: NVENC (GPU) to libx264 (CPU) for encoding, PyCaps to FFmpeg burn-in for subtitles, OpenAI/Gemini to local heuristics for AI analysis, and CUDA to CPU for TTS. This makes the tool more usable in environments where GPU availability or API access is inconsistent.
Integrated Development and Deployment: The project provides two clear installation paths: a Makefile-based installer that handles environment setup and building Decord (a video processing library) with CUDA support, and a Docker container setup. This lowers the barrier to entry for users who may not be familiar with complex Python environment management.

Limitations and Practical Considerations

While AutoShorts automates a complex workflow, its effectiveness is tied to several practical constraints and inherent limitations of the approach.

Hardware Requirements are Strict: The tool is built around NVIDIA GPUs with CUDA support. The recommended RTX series is necessary for both NVENC hardware encoding and the local TTS engine. Without a compatible GPU, the system will fall back to CPU processing, which will be significantly slower. The installation process also requires building Decord from source with CUDA enabled, which can be a point of friction for users without a development environment.

AI Analysis is Not Foolproof: The AI scene analysis, whether using GPT models or Gemini, is fundamentally a classification task. It will identify moments based on patterns it has learned, but it cannot understand context like a human editor. A "funny" moment might be missed if it's subtle, or an "action" sequence might be over-selected if the game has constant combat. The "mixed" mode attempts to auto-detect the best category, but its accuracy depends on the quality of the underlying model and the specificity of the gameplay. The local heuristic fallback is a practical alternative but likely less nuanced.

Voiceover and Subtitles Add Layers of Complexity: The AI voiceover feature, while powerful, introduces a new creative variable. The emotion control and multilingual support are impressive for a local tool, but the output quality will depend on the TTS model's training data. Similarly, AI-generated captions for gameplay without voice commentary are an interpretive task—summarizing visual events into text—which can be error-prone. The tool offers styles like "gaming" or "dramatic," but these are stylistic overlays on top of the core AI interpretation.

The "Viral-Ready" Promise is Subjective: The term "viral-ready" is a marketing claim, not a technical guarantee. The tool automates the format of viral content (short, vertical, fast-paced) but cannot guarantee engagement or views. The quality of the output is directly dependent on the quality of the input footage and the user's configuration choices. A poorly recorded gameplay session will still yield a poorly generated short.

How It Works: The Pipeline

For those interested in the technical flow, AutoShorts follows a structured pipeline:

Input & Analysis: The user places source videos in a gameplay/ directory. The system uses decord (a video reader) with PyTorch on the GPU to stream and analyze video frames. Audio is processed with torchaudio to calculate RMS and spectral flux, identifying loud or dynamic moments.
Scene Scoring: A combined score is calculated for potential clips, weighing audio (0.6) and video (0.4) signals. The AI provider (if enabled) adds a semantic score based on the chosen goal (action, funny, etc.). The highest-scoring segments are selected as candidates.
Processing & Generation: For each candidate clip:
- Cropping: The video is cropped to the target aspect ratio (e.g., 9:16). If the source isn't vertical, it can apply a blurred background fill.
- Subtitles: Depending on the mode, Whisper transcribes existing audio, or an AI model generates contextual captions. PyCaps or FFmpeg overlays the text.
- Voiceover: If enabled, the ChatterBox TTS engine generates audio from a text prompt, which is then mixed with the game audio (with automatic ducking).
Rendering: The final video is rendered using PyTorch and the NVENC hardware encoder for speed. If NVENC fails, it falls back to the CPU-based libx264 encoder.
Output: Generated clips, subtitle data, and logs are saved to a generated/ directory.

Configuration and Customization

Users have significant control through a .env configuration file. Key parameters include:

AI Provider & Model: AI_PROVIDER=openai or gemini, with OPENAI_MODEL set to specific versions like gpt-5-mini.
Semantic Goal: SEMANTIC_GOAL=mixed for auto-detection, or a specific category.
Subtitle Mode: SUBTITLE_MODE=speech for transcription, ai_captions for generated text, or none.
TTS Settings: TTS_LANGUAGE (e.g., en, ja), TTS_EMOTION_LEVEL for intensity.
Video Output: TARGET_RATIO_W/H=9/16, SCENE_LIMIT for maximum clips per source.

For development and testing, debug variables like DEBUG_SKIP_ANALYSIS and DEBUG_SKIP_RENDER allow users to isolate parts of the pipeline without running the full, computationally expensive process.

Conclusion

AutoShorts represents a practical, open-source implementation of an automated video editing pipeline tailored for gamers. Its strength is in its modularity and robust fallback systems, making it adaptable to different hardware and API availability scenarios. However, it is not a magic bullet. It requires a compatible NVIDIA GPU, careful configuration, and an understanding that AI-driven analysis and generation have inherent limitations. For creators with extensive gameplay libraries and the technical willingness to set up a local GPU environment, it offers a powerful way to streamline content repurposing. For others, the hardware and setup barriers may be significant. The project's value is best realized by those who can treat it as a configurable tool rather than a fully autonomous solution.

Relevant Resources:

AutoShorts GitHub Repository
Decord Library (Used for video processing)
PyCaps (Used for caption styling)
OpenAI Whisper (For speech transcription)
NVIDIA CUDA Toolkit (Required for GPU acceleration)

#Open Source #Video Editing #Gaming #GPU #TTS