Developer builds local AI system that indexes video archives using Gemma 4, making years of unorganized footage searchable in plain English without cloud processing.
When you spend half your year in the Maasai Mara capturing footage from iPhones, drones, and professional cameras, and the other half in Silicon Valley as a developer, you end up with a video problem. That's the situation NJ found themselves in, with years of raw footage piling up on SSDs, untouched and unedited.
The core issue isn't just the volume of video—it's the inability to find specific moments in an unlabeled archive. "Every photographer or videographer I know is sitting on the same problem: an archive that grows faster than they can edit it," NJ explains. "The second half is why mine never gets touched."
Instead of turning to existing AI video editing tools that assume pre-labeled footage, NJ built a custom solution that creates an index of video content locally. The result is a system that can search through a year of video footage using plain English queries, all running on a five-year-old MacBook Pro.
The Problem with Existing Solutions
The initial approach involved a SaaS stack combining Eddie AI for iterative editing, Higgsfield MCP for generative B-roll, Submagic for captions, and Buffer for cross-posting. This would have cost around $140 monthly but presented two immediate issues:
"Generative AI video has no place on a real travel brand," NJ notes. "Guests pay $300 a night and up to see the actual place, and mislabeled AI shots equals TripAdvisor crucifixion."
The tools couldn't solve the fundamental problem: finding content in an unlabeled archive. "Every AI video editor on the market assumes your footage is already labeled. Mine is IMG_*.mov and DJI_*.mp4 across folders with names like Mara june 2024 backup final FINAL."
The Local Indexing Solution
What NJ built instead is a local-first system that creates .description.md "sidecar" files for each video clip, living right next to the original footage. This approach has several advantages:
- Privacy: No need to upload thousands of multi-gigabyte clips to the cloud
- Portability: Sidecar files travel with the data when files move between drives
- Resilience: The system survives if the indexer breaks
- Comprehensive indexing: Captures everything in one vision pass rather than requiring multiple processing steps
The per-clip pipeline involves several components working together:
- ffprobe for basic metadata extraction
- exiftool for GPS coordinates (works on iPhone, DJI Pocket, drone footage)
- Nominatim for reverse geocoding
- ffmpeg extracts five evenly-spaced frames at 1920px resolution
- WhisperX for transcription with word-level alignment and speaker diarization
- insightface for face detection and storing 512-dim ArcFace embeddings
- Vision model (Gemma 4) reads frames, transcript, and folder context
The system outputs YAML frontmatter plus a prose description for each clip, creating a rich, searchable index of the entire video archive.
Running on Older Hardware
The most surprising aspect of this project is that it runs effectively on a 5-year-old MacBook Pro M1 Max with 64GB of RAM. NJ bought this laptop in 2021 with no intention of running large language models on it—purely for handling memory-intensive development work.
"LM Studio with Gemma 4 31B Q4 loaded. 28.40 GB of model in memory, REST API at 127.0.0.1:1234," NJ describes the setup. During bulk processing, the system pushed the hardware to its limits, with Activity Monitor reporting 50.89 GB of swap used at peak.
"My laptop ran hot, the fans spun up, and it kept producing sidecars while I worked on other things," NJ recalls. "The M1 Max 16-inch is, honestly, legendary. People in the Mac community talk about it that way for good reason: five years on, it's running 31B-parameter models at usable speed with the kind of headroom that should not exist on hardware this old."
Technical Challenges and Solutions
Building the system revealed several interesting technical challenges:
WhisperX API changes: The diarization API had breaking changes between versions. The solution was signature introspection that tries both parameter names.
Claude CLI permission handling: The CLI returns permission errors as successful responses. The fix was adding permission flags and defensive checks for permission-denial text.
Schema design issues: Gemma returned "people_count: 'many'" instead of an integer. The solution was stricter prompts plus coercion in the parser.
Content culling logic: Initial criteria were too aggressive, culling handheld nighttime motorcycle clips that had artistic value. The solution was reframing cull criteria to "not a real recording" only.
Key Insights
Through this project, NJ developed several important insights about AI video processing:
Enum constraints beat instructions for confabulation prevention: "A model can lie about open-ended prose, but it can only mis-pick from an enum, never invent a new value."
Local 31B with structured prompts closes most of the gap to cloud: "Gemma 4 31B Q4 thinking-off against a structured schema produces output that's hard to distinguish from Sonnet 4.6 on most of my test clips."
AI video editors are pitched one layer too high: "The valuable layer is the index. Once your archive is queryable in plain English, the editor on top is straightforward."
Future Plans
The indexing system is complete, but NJ is now building an editor that leverages this index. "This weekend I'm building the editor: Claude Code as the orchestrator, DaVinci Resolve MCP for the cuts, ElevenLabs for voiceover on informational clips."
There's one important ethical constraint: "The voice clone is for utility content only. Directions, room descriptions, multilingual versions, factual stuff I'd say in person anyway. Never for testimonials or founder messages."
The code for this project has been open-sourced at github.com/Simbastack-hq/framedex, with NJ inviting contributions and feedback.

This project represents an interesting approach to solving a real problem for content creators: the overwhelming backlog of personal video footage. By leveraging local AI models and thoughtful architecture, NJ has created a system that makes video archives searchable without compromising privacy or requiring expensive cloud processing.
As local AI models continue to improve and hardware becomes more efficient, solutions like this could democratize video content organization for creators, researchers, and archivists who need to work with sensitive or large video collections.

Comments
Please log in or register to join the discussion