Database migrations represent the highest-risk operation in zero-downtime deployments, capable of locking tables, breaking queries, or silently corrupting data. This article explores the expand-contract pattern as the canonical approach to safe schema evolution in distributed systems, covering detailed strategies for column, index, and table modifications while maintaining backward compatibility.

Zero-Downtime Database Migrations

Introduction

In modern distributed systems, database migrations present one of the most challenging operational hurdles. Unlike monolithic applications where deployments happen in synchronized lockstep, distributed environments often have multiple service instances running different code versions simultaneously. This creates a complex problem: how do we evolve database schemas without causing downtime, data corruption, or service disruption?

Schema changes can have far-reaching consequences. A poorly executed migration might lock entire tables, blocking read and write operations. It might break existing queries that depend on specific column structures or indexes. Worst of all, it might silently corrupt data, creating subtle bugs that surface only after the migration completes, making troubleshooting extremely difficult.

The expand-contract pattern has emerged as the canonical approach to solving these challenges. This pattern breaks down schema evolution into three distinct phases: expand (adding new elements), migrate (transitioning data), and contract (removing old elements). By following this structured approach, teams can safely modify their database schemas while maintaining system availability and data integrity.

The Expand-Contract Pattern

Expand Phase

The expand phase focuses on adding new schema elements without modifying or removing existing ones. This approach ensures backward compatibility, allowing old code to continue functioning unchanged while new code begins utilizing the new elements.

Column Additions

Add new columns as nullable with default values
For NOT NULL columns, first add as nullable, then backfill data, then add constraint
Modern databases like PostgreSQL can add NOT NULL columns with defaults without table rewrites

Table Additions

Create new tables alongside existing ones
No modifications to existing tables or indexes
New tables remain empty until populated during migration

Index Additions

Add indexes in the background where possible
PostgreSQL supports CREATE INDEX CONCURRENTLY for non-blocking index creation
MySQL 8+ supports online DDL for most index operations

During this phase, old code continues to work with the original schema elements. New code can begin writing to both old and new elements to prepare for the eventual switchover. This dual-write approach ensures that when the migration occurs, the new elements contain the necessary data.

Migration Phase

The migration phase transitions from old to new schema elements. This is typically accomplished through data migration scripts that backfill new columns or tables from existing data.

Batch Processing

Migrations should run in batches with progress tracking
Implement pause and resume functionality
Use database-specific online DDL features to minimize locks:
- PostgreSQL: pg_repack
- MySQL: pt-online-schema-change, gh-ost

Minimizing Lock Contention

Process data in batches with sleep intervals between batches
Use transactions carefully to avoid long-running locks
Consider time-based migrations during off-peak hours

Progress Tracking

Implement checkpoints to track migration progress
Record the last processed ID or timestamp to enable resumption
Expose migration status through monitoring dashboards

The migration phase represents the highest-risk period. If the deployment fails during this phase, the migration can be paused or reversed, allowing the team to diagnose and address issues before retrying.

Contract Phase

The contract phase removes old schema elements that are no longer needed. This critical phase only occurs after all running instances have been updated to use the new schema exclusively.

Verification Requirements

Dropping columns: Confirm no query references them
Dropping indexes: Verify query planners no longer use them
Dropping tables: Ensure no foreign keys or application code reference them

Safe Removal Process

Use code audits to confirm references are removed
Monitor system behavior after removal
Implement rollback plans for critical components

The contract phase is non-reversible, which is why it must be executed as a separate, confirmed step. Teams should have thorough testing and monitoring in place before removing any schema elements.

Detailed Migration Strategies

Column Modifications

Adding Columns

Nullable columns with default values are generally safe
NOT NULL columns require special handling:
1. Add column as nullable
2. Backfill data in batches
3. Add NOT NULL constraint
PostgreSQL can add NOT NULL columns with defaults without table rewrites in recent versions

Modifying Column Types

Type changes often require table rewrites
Use online schema change tools to minimize downtime
Consider creating a new column with the correct type, backfilling data, then dropping the old column

Dropping Columns

First verify no code references the column
Use application monitoring to confirm zero usage
Consider marking columns as deprecated in code before physical removal

Index Modifications

Adding Indexes

Use concurrent creation where supported (PostgreSQL)
Monitor system resources during index creation
Verify index usage with EXPLAIN plans

Modifying Indexes

Drop and recreate indexes in separate phases
Consider creating new indexes before dropping old ones
Monitor query performance after index changes

Dropping Indexes

Confirm query patterns no longer depend on the index
Monitor query performance after removal
Consider keeping indexes that are occasionally used

Table Modifications

Renaming Tables

Two-phase approach:
1. Add a view or synonym with the old name
2. Rename the table and update code
Maintain backward compatibility during transition

Splitting Tables

Create new tables
Implement dual-write logic to keep both tables synchronized
Backfill historical data
Switch reads to new tables after backfill completes

Merging Tables

Create the merged table structure
Backfill data from source tables
Update application code to use the merged table
Remove old tables after confirming successful migration

Foreign Key Management

Adding foreign keys:
- Use NOT VALID constraints initially
- Validate constraints as background operations
- Prevents deadlocks during high-traffic periods
Removing foreign keys:
- Verify no dependent data exists
- Check application logic for foreign key references

Rollback Strategies

The expand-contract pattern naturally supports rollbacks at different phases:

Expand Phase Rollbacks

If deployment fails during expand, simply roll back application code
New schema elements (empty columns, tables) are harmless
System continues operating with original schema

Migration Phase Rollbacks

Data backfill can be paused or reversed
Requires tracking migration progress for resumption
May need cleanup of partially populated new elements

Contract Phase Rollbacks

Non-reversible by nature
Should only proceed after thorough verification
Consider maintaining old elements in deprecated state

Orchestration and Automation

Migration Tools

Flyway: Version-based migrations with Java focus
Liquibase: XML-based change tracking with advanced features
Alembic: Python-focused migration tool
Custom solutions for specialized requirements

CI/CD Integration

Test migrations against staging databases
Automate migration deployment as part of release process
Implement pre-flight checks before production migrations

Idempotent Scripts

Migration scripts should be safe to run multiple times
Use conditional logic to check current state
Implement up/down migrations for rollbacks

Monitoring and Alerting

Critical Metrics

Migration progress and completion status
Database CPU and memory usage
Replication lag in distributed systems
Lock contention and wait times
Query latency and error rates

Alerting Strategies

Set up alerts for metric deviations from baseline
Implement automated pause mechanisms for performance issues
Configure notifications for migration completion or failure

Performance Considerations

Schedule migrations during off-peak hours when possible
Implement rate limiting for high-impact operations
Consider read replicas for complex migrations

Trade-offs and Considerations

Complexity vs. Safety

The expand-contract pattern adds operational complexity
This complexity is justified by the safety guarantees it provides
Teams must balance thoroughness with deployment velocity

Resource Utilization

Background migrations consume database resources
Long-running migrations may impact system performance
Teams must balance migration speed with system stability

Testing Challenges

Testing zero-downtime migrations requires sophisticated staging environments
Must simulate production data volumes and traffic patterns
Chaos engineering can help validate migration resilience

Conclusion

Zero-downtime database migrations represent a critical capability for modern distributed systems. The expand-contract pattern provides a structured approach to schema evolution that minimizes risk while enabling continuous deployment. By carefully planning migrations, implementing robust monitoring, and maintaining rollback capabilities, teams can evolve their database schemas without sacrificing system availability or data integrity.

As systems grow in complexity and scale, the importance of safe database migration practices only increases. Organizations that invest in proper migration processes and tools position themselves for more reliable, responsive, and resilient systems in the long term.

Build seamlessly, securely, and flexibly with MongoDB Atlas. Try free.

Zero-Downtime Database Migrations

Zero-Downtime Database Migrations

Introduction

The Expand-Contract Pattern

Expand Phase

Migration Phase

Contract Phase

Detailed Migration Strategies

Column Modifications

Index Modifications

Table Modifications

Rollback Strategies

Orchestration and Automation

Migration Tools

CI/CD Integration

Idempotent Scripts

Monitoring and Alerting

Critical Metrics

Alerting Strategies

Performance Considerations

Trade-offs and Considerations

Complexity vs. Safety

Resource Utilization

Testing Challenges

Conclusion

Comments