Scaling PostgreSQL to Power 800 Million ChatGPT Users: Engineering Challenges and Solutions
#Infrastructure

Scaling PostgreSQL to Power 800 Million ChatGPT Users: Engineering Challenges and Solutions

AI & ML Reporter
2 min read

OpenAI details how they scaled PostgreSQL to handle millions of queries per second for ChatGPT's massive user base through connection pooling, read replicas, caching innovations, and strategic workload migration.

Featured image

When ChatGPT exploded to 100 million users within weeks of launch, OpenAI's engineering team faced an unprecedented database scaling challenge. Their core transactional database – PostgreSQL – suddenly needed to handle exponential growth, with loads increasing over 10x in a year. This technical deep dive reveals how they scaled PostgreSQL to support 800 million users through rigorous optimization while maintaining five-nines availability.

The Scaling Dilemma

OpenAI's architecture relies on a single Azure PostgreSQL primary instance for writes and nearly 50 geo-distributed read replicas. While PostgreSQL handles read-heavy workloads efficiently, three critical pressure points emerged:

  1. Write amplification: PostgreSQL's MVCC implementation causes significant overhead, where updating a single field copies entire rows and creates dead tuples
  2. Single-writer bottleneck: All writes route through one primary instance
  3. Replication lag: Adding replicas increases primary's WAL shipping overhead

Scaling load diagram Caption: Scaling PostgreSQL under heavy load requires preventing vicious cycles of latency and retries

Architectural Solutions

Write Load Reduction

  • Migrated shardable write-heavy workloads to Azure Cosmos DB
  • Implemented lazy writes and write coalescing to smooth traffic spikes
  • Enforced strict backfill rate limits (some operations take >1 week)
  • Prohibited new tables in PostgreSQL, defaulting to sharded systems

Connection Management Deploying PgBouncer in transaction pooling mode reduced connection setup latency from 50ms to 5ms and prevented connection storms:

postgreSQL proxy diagram

Replication Scaling With 50+ replicas, WAL shipping threatened to overwhelm the primary. OpenAI is testing cascading replication where intermediate replicas relay WAL downstream:

postgreSQL cascading replication diagram

Cache Protection Implemented cache leasing:

  • Only one request fetches data on cache miss
  • Concurrent requests wait for populated cache
  • Prevents cache-miss storms from overwhelming PostgreSQL

Critical Optimizations

  • Query Tuning: Eliminated 12-table joins, moved complex joins to application layer
  • Workload Isolation: Separated high/low-priority traffic to prevent "noisy neighbor" effects
  • Schema Management: Restricted schema changes to lightweight operations avoiding full-table rewrites
  • Rate Limiting: Implemented multi-layer throttling at application, proxy, and query levels

Performance Results

Despite scaling challenges, the team achieved:

  • p99 latency in double-digit milliseconds
  • 99.999% availability over 12 months
  • Only one SEV-0 PostgreSQL incident during viral feature launch
  • Capacity headroom for continued growth

Future Directions

  • Complete migration of non-shardable write workloads
  • Production rollout of cascading replication
  • Evaluation of sharded PostgreSQL for future needs

This scaling journey demonstrates PostgreSQL's underappreciated capacity for massive read-heavy workloads. As OpenAI engineer Bohan Zhang notes: "With rigorous optimization, we've pushed PostgreSQL to handle millions of QPS while maintaining low latency and high reliability – something many considered impractical at our scale."

Comments

Loading comments...