Data warehouses like BigQuery excel at analytics but struggle with operational latency and cost at scale. Sarah Usher presents a three-stage data lifecycle (Raw, Curated, Use Case) that separates storage from compute, enabling teams to build ML pipelines and analytics without bottlenecking each other.
[[IMAGE:1]]
Beyond the Warehouse: Why BigQuery Alone Won't Solve Your Data Problems
Speaker: Sarah Usher, software engineer specializing in data engineering and scalable system design Event: QCon London 2026 Duration: 44:51
The Breaking Point
Many organizations start with a single data warehouse—often BigQuery—as their "one tool to rule them all." Initially, this works well. But as data sources multiply and latency requirements tighten, warehouses hit a breaking point:
- Slowing performance: Adding more data sources without increasing latency becomes impossible
- Disorganized data: Confusion about which tables to use for what purpose
- Innovation bottlenecks: Simple operational queries (like
SELECT * FROM table) can take minutes, making them unusable for product features - Rising costs: Scaling warehouses requires adding expensive compute resources
The Real Problem
When a warehouse can't keep up, teams bypass it entirely. Usher shares a common scenario:
- A churn prediction service needs real-time customer account data
- The warehouse is too slow, so the team queries the source service directly
- They build a custom cleaning and ML pipeline, outputting results to S3
- A second team copies this approach for a different ML service
Result: Multiple teams making duplicate API calls, cleaning data differently, and creating inconsistent S3 storage patterns.
Rethinking Data Architecture
Lineage and Source of Truth
Data lineage is the complete path data takes from origin to final use. Source of truth is the authoritative, reliable data source—but Usher argues it shouldn't be a single system for all data.
Instead, she defines source of truth as "where lineage starts to split off." This shift in thinking enables better architectural decisions.
The Three-Stage Data Lifecycle
Usher proposes a conceptual model that diverges from the traditional medallion approach:
- Raw: Unprocessed, immutable data stored as-is (CSV, JSON, etc.)
- Curated: Cleaned, deduplicated, normalized data that still matches the source structure
- Use Case: Highly refined data optimized for specific applications (different formats, schemas, or APIs)
This model allows different implementations per dataset while maintaining conceptual consistency.
The Solution Architecture
Step 1: Store Raw Data
Store raw data in its original format (Usher prefers S3 files) to maintain immutability and enable replayability. This preserves the ability to reprocess data with new architectures or schemas.
Step 2: Curate Efficiently
Use faster processing systems (Spark, streaming, or specialized tools) to clean and normalize data. Output curated data in formats like Avro to S3 buckets.
Step 3: Separate Use Cases
Move the warehouse to the use case layer, focusing on analytics rather than being the central data pipeline. This reduces processing overhead and enables better scaling.
Real-World Implementation
For the customer churn example:
- Before: Both ML services bypassed the warehouse, duplicating API calls and data cleaning
- After: Raw data flows through a dedicated pipeline, curated data is shared, and each service has its own use case output
This eliminates duplicate processing, standardizes data cleaning, and creates clear pathways for different use cases.
Cultural and Process Changes
Beyond architecture, organizations need:
- Standard naming conventions: Intuitive S3 prefixes for discoverability
- Clear data contracts: APIs and schemas that teams can rely on
- Process documentation: How data flows through each stage
Handling Change and Evolution
Data Changes
Raw data should be immutable—changes create new versions rather than overwriting history. This enables:
- Historical analysis: See how data evolved over time
- Error recovery: Fix mistakes without losing original data
- Schema evolution: Adapt to new fields without reprocessing everything
Use Case Changes
When use cases evolve:
- New fields require updates to raw capture, curated processing, and use case outputs
- The three-stage model localizes changes to specific stages
- Teams can rebuild use cases from curated data without touching raw sources
Q&A Highlights
Q: How do you identify the true source of truth when data splits multiple times? A: Look for where the actual split occurs in practice, not where teams claim it is. The true source is often upstream from the warehouse.
Q: Should curated data be pushed or pulled to use cases? A: Either approach works. The key is making curated data available through APIs, streams, or files that use cases can consume. Q: What about pushing processed data back as a new source? A: Treat calculated metrics as new entities and restart the lifecycle. They become raw data for the next stage.
Q: How do you handle data mistakes? A: Keep the original raw data and fix issues at the curated stage. This preserves history while enabling corrections.
Key Takeaways
- Warehouses aren't everything: They excel at analytics but struggle with operational latency
- Separate storage from compute: Store raw data immutably, curate efficiently, and optimize for use cases
- Design your lineage: Control where source of truth lives rather than accepting where it lands
- Three stages over two: The curated stage prevents duplicate processing and enables evolution
- Store your raw data: This single practice enables architectural flexibility and error recovery
"If you remember only one thing from this talk, it is to please store your raw data." - Sarah Usher
[[IMAGE:2]]
The three-stage data lifecycle provides a conceptual framework that works across different technologies and scales, enabling organizations to move beyond the limitations of single-system data warehouses while maintaining data quality and accessibility.

Comments
Please log in or register to join the discussion