Coinbase engineers traced the May 7 trading outage to an AWS cooling failure, then found their own low-latency architecture extended the disruption.

Coinbase engineers published a postmortem on a May 7, 2026, outage that stopped trading, deposits, withdrawals and transfers across much of the exchange for several hours.
AWS operators lost cooling in a data hall in the US-East-1 region. Rack temperatures rose, AWS shut down affected hardware, and Coinbase lost Amazon EC2 instances and Amazon EBS volumes inside one availability zone.
Coinbase engineers traced the long recovery to two internal systems. The exchange matching engine lost quorum after AWS took down three of five nodes. Kafka workloads also stayed pinned to the impaired zone, building backlogs that slowed platform recovery.
Matching engine design
Coinbase engineers built the matching engine for low latency. The service runs a five-node Raft cluster inside one AWS Cluster Placement Group, which keeps nodes close enough for fast consensus traffic. The design helps high-frequency trading workloads, where extra network hops can change order execution.
The same placement choice constrained recovery. Raft needs a majority of nodes to accept writes. After AWS shut down three nodes, the cluster could not process trades. Coinbase engineers had to reconstruct the cluster, restore quorum, and move trading through staged modes before they reopened normal order flow.
The postmortem shows a familiar financial-systems trade-off. Exchange engineers often place consensus members close together to cut latency. That choice can bind a critical service to one physical failure domain unless engineers add cross-zone failover, rehearse quorum repair, and prove that the failover path meets business rules.
Kafka backlogs
Coinbase engineers also found problems in the event-streaming layer. Kafka partitions that carried operational data stayed in the damaged zone. As dependent services recovered, those partitions could not drain at the pace the platform needed.
Engineers migrated partitions by hand and rebalanced workloads before data flow returned. That work added risk because teams had to change data placement during an active exchange outage.
Apache Kafka gives operators strong tools for replication, partition movement and consumer scaling, but operators still need placement rules that match their failure model. A multi-availability-zone cloud region gives you options; your own partition assignment and failover automation decide whether you can use them under pressure.
Cloud design lessons
AWS designs availability zones as separate failure domains, and the company documents the model in its global infrastructure guidance. Coinbase engineers still found one-zone coupling inside a service that had strict latency goals.
That detail matters for teams running exchanges, payment systems and trading infrastructure on cloud platforms. You can deploy on Amazon EC2 and store state on Amazon EBS, yet still tie a business process to one data hall through placement groups, quorum math, broker assignment or manual recovery steps.
The matching engine used Raft for consensus. Raft gives engineers a clear leader and quorum model, but it forces a hard choice after node loss: preserve consistency or stop writes. Coinbase engineers chose consistency, then rebuilt enough cluster membership to trade again.
Coinbase said it will add automated cross-zone recovery for the matching engine, improve quorum restoration, strengthen messaging resilience and expand disaster recovery tests. Those changes target the work that slowed recovery on May 7: manual cluster repair, stranded data streams and untested assumptions about zone loss.
For infrastructure teams, the incident gives a concrete review list. Map each quorum group to a failure domain. Check broker and partition placement. Prove that recovery scripts work without code changes. Measure the latency cost of cross-zone consensus before a failure forces that decision during market hours.

Comments
Please log in or register to join the discussion