Zero-Downtime Database Migrations
#Regulation

Zero-Downtime Database Migrations

Backend Reporter
7 min read

Database migrations represent the highest-risk operation in zero-downtime deployments, capable of locking tables, breaking queries, or silently corrupting data. This article explores the expand-contract pattern as the canonical approach to safe schema evolution in distributed systems, covering detailed strategies for column, index, and table modifications while maintaining backward compatibility.

Zero-Downtime Database Migrations

Introduction

In modern distributed systems, database migrations present one of the most challenging operational hurdles. Unlike monolithic applications where deployments happen in synchronized lockstep, distributed environments often have multiple service instances running different code versions simultaneously. This creates a complex problem: how do we evolve database schemas without causing downtime, data corruption, or service disruption?

Schema changes can have far-reaching consequences. A poorly executed migration might lock entire tables, blocking read and write operations. It might break existing queries that depend on specific column structures or indexes. Worst of all, it might silently corrupt data, creating subtle bugs that surface only after the migration completes, making troubleshooting extremely difficult.

The expand-contract pattern has emerged as the canonical approach to solving these challenges. This pattern breaks down schema evolution into three distinct phases: expand (adding new elements), migrate (transitioning data), and contract (removing old elements). By following this structured approach, teams can safely modify their database schemas while maintaining system availability and data integrity.

The Expand-Contract Pattern

Expand Phase

The expand phase focuses on adding new schema elements without modifying or removing existing ones. This approach ensures backward compatibility, allowing old code to continue functioning unchanged while new code begins utilizing the new elements.

Column Additions

  • Add new columns as nullable with default values
  • For NOT NULL columns, first add as nullable, then backfill data, then add constraint
  • Modern databases like PostgreSQL can add NOT NULL columns with defaults without table rewrites

Table Additions

  • Create new tables alongside existing ones
  • No modifications to existing tables or indexes
  • New tables remain empty until populated during migration

Index Additions

  • Add indexes in the background where possible
  • PostgreSQL supports CREATE INDEX CONCURRENTLY for non-blocking index creation
  • MySQL 8+ supports online DDL for most index operations

During this phase, old code continues to work with the original schema elements. New code can begin writing to both old and new elements to prepare for the eventual switchover. This dual-write approach ensures that when the migration occurs, the new elements contain the necessary data.

Migration Phase

The migration phase transitions from old to new schema elements. This is typically accomplished through data migration scripts that backfill new columns or tables from existing data.

Batch Processing

  • Migrations should run in batches with progress tracking
  • Implement pause and resume functionality
  • Use database-specific online DDL features to minimize locks:

Minimizing Lock Contention

  • Process data in batches with sleep intervals between batches
  • Use transactions carefully to avoid long-running locks
  • Consider time-based migrations during off-peak hours

Progress Tracking

  • Implement checkpoints to track migration progress
  • Record the last processed ID or timestamp to enable resumption
  • Expose migration status through monitoring dashboards

The migration phase represents the highest-risk period. If the deployment fails during this phase, the migration can be paused or reversed, allowing the team to diagnose and address issues before retrying.

Contract Phase

The contract phase removes old schema elements that are no longer needed. This critical phase only occurs after all running instances have been updated to use the new schema exclusively.

Verification Requirements

  • Dropping columns: Confirm no query references them
  • Dropping indexes: Verify query planners no longer use them
  • Dropping tables: Ensure no foreign keys or application code reference them

Safe Removal Process

  • Use code audits to confirm references are removed
  • Monitor system behavior after removal
  • Implement rollback plans for critical components

The contract phase is non-reversible, which is why it must be executed as a separate, confirmed step. Teams should have thorough testing and monitoring in place before removing any schema elements.

Detailed Migration Strategies

Column Modifications

Adding Columns

  • Nullable columns with default values are generally safe
  • NOT NULL columns require special handling:
    1. Add column as nullable
    2. Backfill data in batches
    3. Add NOT NULL constraint
  • PostgreSQL can add NOT NULL columns with defaults without table rewrites in recent versions

Modifying Column Types

  • Type changes often require table rewrites
  • Use online schema change tools to minimize downtime
  • Consider creating a new column with the correct type, backfilling data, then dropping the old column

Dropping Columns

  • First verify no code references the column
  • Use application monitoring to confirm zero usage
  • Consider marking columns as deprecated in code before physical removal

Index Modifications

Adding Indexes

  • Use concurrent creation where supported (PostgreSQL)
  • Monitor system resources during index creation
  • Verify index usage with EXPLAIN plans

Modifying Indexes

  • Drop and recreate indexes in separate phases
  • Consider creating new indexes before dropping old ones
  • Monitor query performance after index changes

Dropping Indexes

  • Confirm query patterns no longer depend on the index
  • Monitor query performance after removal
  • Consider keeping indexes that are occasionally used

Table Modifications

Renaming Tables

  • Two-phase approach:
    1. Add a view or synonym with the old name
    2. Rename the table and update code
  • Maintain backward compatibility during transition

Splitting Tables

  • Create new tables
  • Implement dual-write logic to keep both tables synchronized
  • Backfill historical data
  • Switch reads to new tables after backfill completes

Merging Tables

  • Create the merged table structure
  • Backfill data from source tables
  • Update application code to use the merged table
  • Remove old tables after confirming successful migration

Foreign Key Management

  • Adding foreign keys:
    • Use NOT VALID constraints initially
    • Validate constraints as background operations
    • Prevents deadlocks during high-traffic periods
  • Removing foreign keys:
    • Verify no dependent data exists
    • Check application logic for foreign key references

Rollback Strategies

The expand-contract pattern naturally supports rollbacks at different phases:

Expand Phase Rollbacks

  • If deployment fails during expand, simply roll back application code
  • New schema elements (empty columns, tables) are harmless
  • System continues operating with original schema

Migration Phase Rollbacks

  • Data backfill can be paused or reversed
  • Requires tracking migration progress for resumption
  • May need cleanup of partially populated new elements

Contract Phase Rollbacks

  • Non-reversible by nature
  • Should only proceed after thorough verification
  • Consider maintaining old elements in deprecated state

Orchestration and Automation

Migration Tools

  • Flyway: Version-based migrations with Java focus
  • Liquibase: XML-based change tracking with advanced features
  • Alembic: Python-focused migration tool
  • Custom solutions for specialized requirements

CI/CD Integration

  • Test migrations against staging databases
  • Automate migration deployment as part of release process
  • Implement pre-flight checks before production migrations

Idempotent Scripts

  • Migration scripts should be safe to run multiple times
  • Use conditional logic to check current state
  • Implement up/down migrations for rollbacks

Monitoring and Alerting

Critical Metrics

  • Migration progress and completion status
  • Database CPU and memory usage
  • Replication lag in distributed systems
  • Lock contention and wait times
  • Query latency and error rates

Alerting Strategies

  • Set up alerts for metric deviations from baseline
  • Implement automated pause mechanisms for performance issues
  • Configure notifications for migration completion or failure

Performance Considerations

  • Schedule migrations during off-peak hours when possible
  • Implement rate limiting for high-impact operations
  • Consider read replicas for complex migrations

Trade-offs and Considerations

Complexity vs. Safety

  • The expand-contract pattern adds operational complexity
  • This complexity is justified by the safety guarantees it provides
  • Teams must balance thoroughness with deployment velocity

Resource Utilization

  • Background migrations consume database resources
  • Long-running migrations may impact system performance
  • Teams must balance migration speed with system stability

Testing Challenges

  • Testing zero-downtime migrations requires sophisticated staging environments
  • Must simulate production data volumes and traffic patterns
  • Chaos engineering can help validate migration resilience

Conclusion

Zero-downtime database migrations represent a critical capability for modern distributed systems. The expand-contract pattern provides a structured approach to schema evolution that minimizes risk while enabling continuous deployment. By carefully planning migrations, implementing robust monitoring, and maintaining rollback capabilities, teams can evolve their database schemas without sacrificing system availability or data integrity.

As systems grow in complexity and scale, the importance of safe database migration practices only increases. Organizations that invest in proper migration processes and tools position themselves for more reliable, responsive, and resilient systems in the long term.

Build seamlessly, securely, and flexibly with MongoDB Atlas. Try free.

Comments

Loading comments...