Data warehouses like BigQuery excel at analytics but struggle with operational latency and cost at scale. Sarah Usher presents a three-stage data lifecycle (Raw, Curated, Use Case) that separates storage from compute, enabling teams to build ML pipelines and analytics without bottlenecking each other.

[[IMAGE:1]]

Beyond the Warehouse: Why BigQuery Alone Won't Solve Your Data Problems

Speaker: Sarah Usher, software engineer specializing in data engineering and scalable system design Event: QCon London 2026 Duration: 44:51

The Breaking Point

Many organizations start with a single data warehouse—often BigQuery—as their "one tool to rule them all." Initially, this works well. But as data sources multiply and latency requirements tighten, warehouses hit a breaking point:

Slowing performance: Adding more data sources without increasing latency becomes impossible
Disorganized data: Confusion about which tables to use for what purpose
Innovation bottlenecks: Simple operational queries (like SELECT * FROM table) can take minutes, making them unusable for product features
Rising costs: Scaling warehouses requires adding expensive compute resources

The Real Problem

When a warehouse can't keep up, teams bypass it entirely. Usher shares a common scenario:

A churn prediction service needs real-time customer account data
The warehouse is too slow, so the team queries the source service directly
They build a custom cleaning and ML pipeline, outputting results to S3
A second team copies this approach for a different ML service

Result: Multiple teams making duplicate API calls, cleaning data differently, and creating inconsistent S3 storage patterns.

Rethinking Data Architecture

Lineage and Source of Truth

Data lineage is the complete path data takes from origin to final use. Source of truth is the authoritative, reliable data source—but Usher argues it shouldn't be a single system for all data.

Instead, she defines source of truth as "where lineage starts to split off." This shift in thinking enables better architectural decisions.

The Three-Stage Data Lifecycle

Usher proposes a conceptual model that diverges from the traditional medallion approach:

Raw: Unprocessed, immutable data stored as-is (CSV, JSON, etc.)
Curated: Cleaned, deduplicated, normalized data that still matches the source structure
Use Case: Highly refined data optimized for specific applications (different formats, schemas, or APIs)

This model allows different implementations per dataset while maintaining conceptual consistency.

The Solution Architecture

Step 1: Store Raw Data

Store raw data in its original format (Usher prefers S3 files) to maintain immutability and enable replayability. This preserves the ability to reprocess data with new architectures or schemas.

Step 2: Curate Efficiently

Use faster processing systems (Spark, streaming, or specialized tools) to clean and normalize data. Output curated data in formats like Avro to S3 buckets.

Step 3: Separate Use Cases

Move the warehouse to the use case layer, focusing on analytics rather than being the central data pipeline. This reduces processing overhead and enables better scaling.

Real-World Implementation

For the customer churn example:

Before: Both ML services bypassed the warehouse, duplicating API calls and data cleaning
After: Raw data flows through a dedicated pipeline, curated data is shared, and each service has its own use case output

This eliminates duplicate processing, standardizes data cleaning, and creates clear pathways for different use cases.

Cultural and Process Changes

Beyond architecture, organizations need:

Standard naming conventions: Intuitive S3 prefixes for discoverability
Clear data contracts: APIs and schemas that teams can rely on
Process documentation: How data flows through each stage

Handling Change and Evolution

Data Changes

Raw data should be immutable—changes create new versions rather than overwriting history. This enables:

Historical analysis: See how data evolved over time
Error recovery: Fix mistakes without losing original data
Schema evolution: Adapt to new fields without reprocessing everything

Use Case Changes

When use cases evolve:

New fields require updates to raw capture, curated processing, and use case outputs
The three-stage model localizes changes to specific stages
Teams can rebuild use cases from curated data without touching raw sources

Q&A Highlights

Q: How do you identify the true source of truth when data splits multiple times? A: Look for where the actual split occurs in practice, not where teams claim it is. The true source is often upstream from the warehouse.

Q: Should curated data be pushed or pulled to use cases? A: Either approach works. The key is making curated data available through APIs, streams, or files that use cases can consume. Q: What about pushing processed data back as a new source? A: Treat calculated metrics as new entities and restart the lifecycle. They become raw data for the next stage.

Q: How do you handle data mistakes? A: Keep the original raw data and fix issues at the curated stage. This preserves history while enabling corrections.

Key Takeaways

Warehouses aren't everything: They excel at analytics but struggle with operational latency
Separate storage from compute: Store raw data immutably, curate efficiently, and optimize for use cases
Design your lineage: Control where source of truth lives rather than accepting where it lands
Three stages over two: The curated stage prevents duplicate processing and enables evolution
Store your raw data: This single practice enables architectural flexibility and error recovery

"If you remember only one thing from this talk, it is to please store your raw data." - Sarah Usher

[[IMAGE:2]]

The three-stage data lifecycle provides a conceptual framework that works across different technologies and scales, enabling organizations to move beyond the limitations of single-system data warehouses while maintaining data quality and accessibility.

#BigQuery #Data Architecture #Data Lifecycle #S3 #Data Engineering

Beyond the Warehouse: Why BigQuery Alone Won't Solve Your Data Problems

Beyond the Warehouse: Why BigQuery Alone Won't Solve Your Data Problems

The Breaking Point

The Real Problem

Rethinking Data Architecture

Lineage and Source of Truth

The Three-Stage Data Lifecycle

The Solution Architecture

Step 1: Store Raw Data

Step 2: Curate Efficiently

Step 3: Separate Use Cases

Real-World Implementation

Cultural and Process Changes

Handling Change and Evolution

Data Changes

Use Case Changes

Q&A Highlights

Key Takeaways

Comments