Zero-Downtime PostgreSQL Upgrade: Mastering Logical Replication for High-Load Clusters

In the fast-paced world of high-load applications, maintaining database availability during upgrades is critical. A recent project involving a PostgreSQL cluster handling over 20,000 transactions per second required an upgrade from version 13 to 16 with minimal downtime. This process not only addressed performance bottlenecks from aging servers but also prepared the system for PostgreSQL 13's impending end-of-life, ensuring continued access to security patches and new features.

Article illustration 1

The cluster, comprising a master and a replica, faced growing data volumes that strained existing hardware. Upgrading to more powerful machines with larger disk volumes was essential, but the real challenge lay in the database migration itself. The team opted for a hybrid replication strategy to achieve near-zero downtime, blending physical and logical replication techniques.

The Replication Strategy: From Physical to Logical

The upgrade process unfolded in two main stages: creating a new replica with the target version and then promoting it to master. Initial attempts at logical replication alone proved too slow for the 3.5 TB database, with synchronization rates of about 1 GB per minute potentially taking days and risking WAL accumulation that could crash the master.

To mitigate this, the team first established a physical replica, which synchronized at a brisk 10 GB per minute. Once synced, they promoted it to standalone mode, captured the Log Sequence Number (LSN) from the logs (e.g., 'redo done at 0/3402801'), and advanced a newly created logical replication slot to that point using commands like:

SELECT pg_create_logical_replication_slot('logical_replica_slot', 'pgoutput');
SELECT pg_replication_slot_advance('logical_replica_slot', '0/3402801');

This ensured the logical replica picked up exactly where the physical one left off, avoiding data gaps. On the master, a publication was created for all tables:

CREATE PUBLICATION our_db_pub FOR ALL TABLES;

A subscription was then set up on the replica without copying data (since it was already synced):

CREATE SUBSCRIPTION our_db_sub CONNECTION 'host=a.b.c.d dbname=our_db' PUBLICATION our_db_pub WITH (copy_data=false, slot_name='logical_replica_slot', create_slot=false);

Navigating Upgrade Challenges with pg_upgrade

With replication in place, the focus shifted to upgrading the replica to PostgreSQL 16. Installation on Ubuntu involved:

apt install postgresql-16 postgresql-16-repack

The pg_upgrade tool facilitated the transition, first checked for compatibility and then executed with hard links to preserve performance:

/usr/lib/postgresql/16/bin/pg_upgrade \
  --old-datadir=/var/lib/postgresql/13/main \
  --new-datadir=/var/lib/postgresql/16/main \
  --old-bindir=/usr/lib/postgresql/13/bin \
  --new-bindir=/usr/lib/postgresql/16/bin \
  --old-options '-c config_file=/etc/postgresql/13/main/postgresql.conf' \
  --new-options '-c config_file=/etc/postgresql/16/main/postgresql.conf' \
  --link

Pre-upgrade preparations included dropping problematic extensions like pg_stat_statements and schemas to avoid migration errors. Post-upgrade, configurations were adjusted for ports and deprecated options, followed by vacuuming and analyzing to update statistics:

/usr/lib/postgresql/16/bin/vacuumdb --all --analyze-in-stages

Challenges arose when subscriptions failed post-upgrade, likely due to the pgoutput extension version mismatch. The solution involved disabling and recreating subscriptions, ensuring no slot deletion occurred, and verifying replication status with queries like:

SELECT * FROM pg_stat_subscription;
SELECT usename, client_addr, state, replay_lag FROM pg_stat_replication;

Ultimately, combining the upgrade with logical replication setup resolved synchronization issues, confirming matching record counts between master and replica.

Seamless Master Switchover and Final Touches

To complete the upgrade, the team warmed the new replica's cache with query executions, set the old master to read-only mode via ALTER SYSTEM SET default_transaction_read_only TO on;, and confirmed zero replay lag. Load balancers were reconfigured to direct traffic to the new master, resulting in just 10-15 seconds of write downtime for checkpointing and switching.

Post-switchover, the old master was upgraded similarly and repurposed as a streaming replica. A rollback plan was in place, involving reverting load balancers and re-enabling subscriptions. One overlooked detail was updating sequences for autoincrementing keys:

ALTER SEQUENCE id_seq RESTART WITH 1822319;

This prevented insertion failures in tables using non-UUID primary keys.

Implications for High-Availability Database Management

This upgrade exemplifies how strategic replication can transform a routine maintenance task into a resilient evolution of infrastructure. By leveraging PostgreSQL's built-in tools, teams can handle massive workloads without sacrificing availability, unlocking version 16's performance optimizations and features like improved parallelism and security enhancements. For developers and DBAs managing similar clusters, this approach offers a blueprint for future-proofing databases amid relentless data growth, ensuring systems not only survive but thrive under pressure.