Perception 2.0: Replacing Traditional Computer Vision Pipelines with Embedding-First Architectures for Autonomous Driving

Kyra Mozley, a Machine Learning Engineer at Wayve, presents a paradigm shift from traditional task-specific computer vision to an embedding-first architecture. By leveraging foundation models like CLIP and SAM, this approach enables auto-labeling, RAG-inspired search, and few-shot adapters to efficiently process petabytes of autonomous driving data, focusing on discovering critical edge cases.

The autonomous driving industry faces a fundamental scaling problem. Traditional computer vision pipelines, built on task-specific models and manual annotation, buckle under the weight of petabyte-scale datasets. Every new perception task—whether detecting a new traffic sign in Germany or identifying a cyclist falling—requires a full loop: data collection, annotation, model training, and deployment. This process is expensive, slow, and inflexible.

In a recent talk at QCon London, Kyra Mozley, a Machine Learning Engineer at Wayve, outlined a new paradigm she calls "Perception 2.0." This architecture moves away from training bespoke models for every task and instead builds on a shared foundation of semantic embeddings. The goal is to unlock insights and enable discovery within vast, unstructured video datasets, with a particular focus on finding rare but critical edge cases.

The Limits of Traditional Perception

For years, autonomous vehicle perception has relied on a standard pipeline:

Data Collection: Fleets capture multi-camera video, LiDAR, and radar data.
Filtering & Curation: Engineers select diverse scenes (urban, highway, day, night).
Manual Annotation: Teams of human annotators draw bounding boxes, segmentation masks, and cuboids. This is the most resource-intensive step, requiring detailed label specifications and external contracts.
Model Training: Convolutional Neural Networks (CNNs) or Vision Transformers are trained on this labeled data, often requiring distributed GPU clusters for days or weeks.
Evaluation & Deployment: Models are evaluated on metrics like mean Intersection over Union (mIoU) and average precision.
Monitoring & Iteration: When data drift occurs (e.g., new regions, new sensor stacks), the entire pipeline repeats.

This approach has several critical drawbacks:

Cost & Time: Manual annotation is slow and expensive. Scaling to new tasks or geographies multiplies the effort.
Rigidity: Models are trained on fixed taxonomies. They cannot detect objects or scenarios outside their predefined classes.
Failure on Edge Cases: Traditional models often fail on rare events. For example, a segmentation model trained on standard classes might misclassify a person lying on the road as "road furniture." A 3D cuboid detector might lose track of a cyclist who falls.
Complexity: Managing parallel pipelines for segmentation, detection, depth estimation, and optical flow becomes operationally challenging.

The core issue is that traditional methods reduce a scene to a limited set of structured outputs (boxes, masks) that cannot capture the narrative or semantic richness of what's happening. They see what they've been trained to see, and nothing more.

The Embedding-First Shift: Perception 2.0

The rise of large-scale, multimodal foundation models has enabled a new approach. Instead of training models to detect predefined labels, we can use these pre-trained models to generate rich semantic representations of our data. The key concept is the embedding.

An embedding is a dense numerical vector (often hundreds of dimensions) that captures the semantic meaning of an input—an image, a sentence, or a video clip. Models like CLIP are trained to embed images and text into the same shared space. This means a photo of a cyclist and the text "a person on a bicycle" will have similar vector representations.

This is the foundation of Perception 2.0:

Embed Once: Pass raw video frames through a foundation model (e.g., InternVideo2, CLIP, or an API like GPT-4 Vision) to generate embeddings for the entire dataset.
Build Flexible Workflows: Store these embeddings in a vector database. Now, you can perform search, clustering, classification, and regression without retraining models or touching raw pixels.
Leverage Zero-Shot Capabilities: Foundation models can understand complex scenes out-of-the-box. For instance, prompting GPT-4 Vision with a video of a cyclist falling can generate a detailed description: "A woman riding a bicycle suddenly swerves into traffic cones and falls off her bike directly in front of a moving car." This is narrative understanding, not just detection.

How to Unlock Insights and Enable Discovery Within Petabytes of Autonomous Driving Data - InfoQ

Foundation Model Landscape

Choosing the right model depends on your needs. Key dimensions include:

Access: Open-source models (e.g., CLIP, InternVideo2) offer control and scalability but require GPU infrastructure. API-based models (e.g., OpenAI, Anthropic) are easier to integrate but can be costly at scale and raise data privacy concerns.
Input Modality: Most models handle images and text, but some, like InternVideo2, are designed for video, capturing temporal dynamics. Others, like ImageBind, support multiple modalities (audio, depth, IMU).
Output: Most output embeddings, which are crucial for similarity search and clustering. Some also output text, useful for structured scene descriptions or visual question answering (VQA).

The goal is to build a model-agnostic pipeline that allows you to swap components as the landscape evolves.

Workflows Built on Embeddings

Once you have embeddings, you can build powerful workflows.

1. Semantic Search with RAG-Inspired Techniques

Embedding-based search allows you to query your dataset with natural language. For example, to find scenes where "a person falls off a bicycle," you embed the query and perform a similarity search against your video frame embeddings using a vector database. This retrieves semantically similar scenes without any prior labeling.

However, this has limitations, especially with positional language (e.g., "in front of the car" vs. "behind the car") where the embedding space may not be spatially grounded. To overcome this, Mozley suggests applying techniques from Retrieval-Augmented Generation (RAG):

Query Rewriting: Use a language model to generate multiple variants of a query (e.g., "cyclist falls in traffic," "biker crashes in road") and search in parallel to improve recall.
Multi-Query Fusion: Combine the results from multiple query embeddings (by averaging vectors or merging top results) to get a more comprehensive set of matches.
Reranking: After retrieval, use a cross-modal model to rescore results based on alignment with the original query, prioritizing precision.

This approach makes search robust for real-world, messy queries common in driving data.

2. Unsupervised Clustering for Discovery

Clustering algorithms (like k-means or DBSCAN) applied to embeddings can automatically discover structure in your data without any labels. This is crucial for finding patterns and anomalies at scale.

Behavioral Grouping: A cluster might contain frames where the vehicle is following lanes on a straight road. This can be labeled as "lane following" and used for data balancing or evaluation.
Anomaly Detection: Another cluster might contain frames with poor visual quality (dark images, raindrops, lens glare). Clustering surfaces these outliers automatically, allowing you to filter bad data or investigate sensor issues.

Clustering shifts the workflow from "start with labels, find data" to "start with embeddings, let the data find the labels."

3. Auto-Labeling with Consensus

Auto-labeling uses pre-trained models to generate labels programmatically, bypassing manual annotation.

Object-Level Labels: Use the Segment Anything Model (SAM) to generate segmentation masks for all objects in an image. Then, pass each segment through CLIP to classify it by comparing its embedding to text prompts (e.g., "car," "pedestrian," "road sign").
Structured Metadata: Prompt a multimodal model like GPT-4 Vision with a JSON schema to generate structured scene descriptions (e.g., weather, road conditions) directly from an image.

However, auto-labeling is not perfect. Models can hallucinate or be inconsistent. To ensure quality, use consensus labeling:

Generate multiple candidate labels using different models or prompts.
Aggregate the outputs using voting, confidence thresholds, or ensemble rules.
Flag low-confidence labels for a small amount of manual review.

This makes auto-labeling reliable enough to replace large-scale manual annotation for many tasks.

4. Zero-to-Few-Shot Adapters

For safety-critical tasks requiring high precision, you can train lightweight task-specific classifiers or regressors on top of the pre-computed embeddings.

Few-Shot Classification: Train a small neural network head on top of embeddings using just 10-30 labeled examples per class. Because the embeddings already encode rich semantics, these models generalize well with minimal data. For example, you can train a classifier to detect "failed lane change" scenarios.
Few-Shot Regression: Similarly, train regression heads to predict continuous values like following distance or time-to-collision. This avoids the need for expensive 3D cuboid annotations and complex geometry pipelines.

This modular approach allows you to add new perception capabilities quickly without retraining massive models from scratch.

The New Perception Pipeline

Perception 2.0 retools the traditional pipeline:

Data Ingestion: Raw video frames.
Embedding Generation: Use foundation models to create semantic embeddings for the entire dataset.
Vector Storage: Store embeddings in a vector database.
Workflow Execution: Perform search, clustering, auto-labeling, and few-shot training as needed.
Evaluation & Refinement: Validate outputs using consensus techniques and small labeled sets.

This pipeline is faster, more flexible, and scales with data. It shifts the bottleneck from annotation to inference and computation, which are more tractable problems.

Conclusion and Future Directions

The traditional computer vision pipeline, while effective, is inflexible and costly. Perception 2.0, powered by foundation models and embeddings, offers a modular alternative. It enables:

Rapid Discovery: Find edge cases like cyclists falling or unusual objects on the road using semantic search and clustering.
Scalable Labeling: Auto-label petabytes of data with consensus techniques, reducing manual effort.
Flexible Adaptation: Add new perception tasks with few-shot learning, avoiding full model retraining.

As Kyra Mozley demonstrated, this approach is already being used at companies like Wayve to process vast autonomous driving datasets. While foundation models are currently too large to deploy on-vehicle, future work will focus on distillation and quantization to bring these capabilities to the edge.

The key takeaway is that we no longer need to train a new model for every perception task. By embedding once and building flexible workflows on top, we can create vision systems that are more scalable, adaptable, and capable of understanding the complex, messy reality of the open road.

Related Resources:

#Autonomous Driving #Computer Vision #Embeddings #Foundation Models #Perception 2.0