Netflix Automates RDS PostgreSQL to Aurora PostgreSQL Migration Across 400 Production Clusters
#Infrastructure

Netflix Automates RDS PostgreSQL to Aurora PostgreSQL Migration Across 400 Production Clusters

Backend Reporter
4 min read

Netflix has built an internal automation platform that migrates Amazon RDS PostgreSQL databases to Aurora PostgreSQL across nearly 400 production clusters, reducing operational risk and downtime through self-service workflows, replication validation, and rollback safeguards.

Netflix has developed an internal automation platform that migrates Amazon RDS for PostgreSQL databases to Amazon Aurora PostgreSQL, reducing operational risk and downtime across nearly 400 production clusters. The system enables service teams to initiate migrations through a self-service workflow while enforcing replication validation, controlled cutover, change data capture coordination, and rollback safeguards.

Netflix routes database access through a platform-managed data access layer built on Envoy, which standardizes mutual TLS and abstracts database endpoints from application code. Because services do not directly manage credentials or connection strings, migrations must occur transparently beneath this layer. The automation, therefore, coordinates replication, validation, cutover, CDC handling, and rollback entirely at the infrastructure level.

Netflix engineers emphasized: Our goal was to make RDS to Aurora migrations repeatable and low-touch, while preserving correctness guarantees for both transactional workloads and CDC pipelines.

The workflow begins by creating an Aurora PostgreSQL cluster as a physical read replica of the source RDS PostgreSQL instance using capabilities provided by Amazon Web Services. The replica is initialized from a storage snapshot and continuously replays write-ahead log records streamed from the source. During this phase, the system validates replication slot health, WAL generation rates, parameter compatibility, extension parity, and sustained replication lag under production traffic, ensuring the replica can sustain peak write throughput before cutover.

Netflix Automates RDS PostgreSQL to Aurora PostgreSQL Migration Across 400 Production Clusters - InfoQ

For workloads using change data capture, including logical replication slots or downstream stream processors, the automation coordinates slot state before quiescence. CDC consumers are paused to prevent excessive WAL retention, and slot positions are recorded so that equivalent replication slots can be recreated on Aurora at the correct log sequence number after promotion. This preserves downstream consistency while avoiding WAL buildup that could increase replication lag.

An early adopter, Netflix's Enablement Applications team, migrated databases supporting device certification and partner billing workflows. During replication, engineers detected an elevated OldestReplicationSlotLag caused by an inactive logical replication slot retaining WAL segments and increasing replication lag. After removing the stale slot, replication converged, and migration completed successfully with post-cutover metrics matching pre-migration baselines.

Netflix Automates RDS PostgreSQL to Aurora PostgreSQL Migration Across 400 Production Clusters - InfoQ

When replication lag approaches zero, the system enters a controlled quiescence phase. Security group rules are modified, and the source RDS instance is rebooted to block new connections at the infrastructure layer. After confirming that all in-flight transactions have been applied and that the Aurora replica has replayed the final WAL records, the replica is promoted to a writable Aurora cluster, and the data access layer routes traffic to the new endpoint.

According to Netflix engineers, rollback was treated as a first-class concern. Until promotion is finalized and traffic is fully shifted, the original RDS instance remains intact as the authoritative source. If validation checks fail during synchronization or if post-promotion health checks detect anomalies, traffic can be redirected back to the RDS cluster through the data access layer. Because applications are decoupled from physical endpoints, reverting the routing configuration restores the prior state without redeployment. CDC consumers can also resume from previously recorded slot positions on the original cluster if required.

The migration automation addresses several critical challenges in large-scale database operations. First, it eliminates the need for application teams to understand or manage database endpoint changes, reducing the operational burden and potential for configuration errors. Second, by coordinating CDC state management, it ensures that downstream systems consuming change streams remain consistent throughout the migration process. Third, the rollback capability provides a safety net that allows teams to proceed with confidence, knowing they can revert to the original state if unexpected issues arise.

This approach reflects Netflix's broader strategy of building platform capabilities that abstract infrastructure complexity away from application teams. By standardizing database access through the Envoy-based data access layer and providing automated migration workflows, Netflix enables teams to focus on their core business logic rather than infrastructure management details.

The scale of this deployment—nearly 400 production clusters—demonstrates the maturity of Netflix's internal tooling and the importance of database migration automation in large-scale cloud operations. As organizations continue to adopt cloud-native architectures and seek to optimize their database infrastructure, solutions like this provide valuable patterns for managing the complexity of large-scale migrations while maintaining service reliability and data consistency.

Comments

Loading comments...