An in-depth analysis of the engineering principles behind building financial exchanges with extreme reliability and performance, focusing on single-threaded architectures, consensus algorithms, and deterministic systems.

Building High-Performance Financial Exchanges: Determinism, Consensus, and Sub-Millisecond Latency

Financial exchanges represent some of the most demanding distributed systems in existence. They require perfect correctness, absolute fairness, 24/7 availability, and sub-millisecond response times—all while handling massive transaction volumes. Frank Yu, Director of Engineering at Coinbase, shared the engineering philosophy behind building such systems in his recent presentation at QCon San Francisco.

The Challenge: Financial Infrastructure with Extreme Requirements

Exchanges serve as financial infrastructure where participants submit orders to buy and sell assets, with the exchange providing up-to-date prices and executing trades. As Yu explains, "Everyone who participates in finance and markets, they come to us to get the up-to-date price of things and also to print trades, assign prices when folks want to trade something."

The reliability requirements are staggering. "My mind, it goes really fast, and if it goes wrong then by the time PagerDuty rings it's over," Yu shares. "The potential losses we risk are multiple orders of magnitude more than any revenue we might get from any transaction."

Core Requirements for Exchanges

Correctness: The system must maintain accurate state at all times
Fairness: All participants must have equal opportunity for performance
Availability: 24/7 operation is non-negotiable
Consistency: Predictable performance (flat P99s, P50s, and averages)
Auditability: Ability to reconstruct market state at any microsecond in the past

These requirements stem from exchanges being trusted third parties in financial markets. "We focus so much on reliability that people in financial markets effectively outsource their reliability to the exchange because they figure, if the exchange goes down, we're toast anyways," Yu notes.

Functional Requirements: Simplifying Complexity

When building an exchange, Yu approaches the problem by first defining functional requirements through a test harness. "The last four or five times, build an exchange. What I do is open up the IDE or the whatever, build up a black box harness of user API actions or user stories to specify the behavior that I want."

A key simplification is the "price, time, priority" model that condenses pages of functional specifications into three concepts:

Price: The value at which a participant is willing to buy or sell
Time: When the order was submitted
Priority: The order's position in the queue for execution

This model handles complex order matching logic. For example, when multiple orders overlap in price, the system determines execution order based on time and priority. When Bob submits a sell order at 99 that overlaps with existing buy orders at 100, the system executes trades starting with the highest priority buy order, ensuring price improvement for Bob while maintaining fairness.

Architecture Choice: Single-Threaded Determinism

The most controversial yet critical design choice is the single-threaded architecture. "If I want to handle a lot of different orders, why don't I assign different CPUs to handle orders? Then each of them can go find other matches, and this and that. That way maybe I can take advantage of parallelism and all that. Could work, for sure," Yu explains before presenting the counter-argument.

The problem with parallelism is determinism: "If you let things be concurrent, that's where you lose determinism. What we talked about, 1, 2, 3, 4, 5, all those actions really should result in the same output every single time."

Benefits of Single-Threaded Architecture

Determinism: Same inputs always produce same outputs
Simplified Scaling: Performance directly tied to single-thread speed
Easier Optimization: Clear hot path to optimize
Simplified Debugging: Entire system state fits in memory on one machine
Perfect Reproduction: Production logs can be replayed exactly for debugging

"If I just handle all those orders in order, like I was clicking through these buttons, I just put it all in one thread. One thread, what does that mean? That means I have one program running on one CPU core, not reordering anything, just handling orders and doing stuff," Yu explains.

Consensus and Durability: Raft in the Hot Path

While the matching engine runs single-threaded, durability is achieved through the Raft consensus algorithm. "Normally what happens is your web server receives an API call, it has to do the work, and then it sends its state over Postgres' network protocol to Postgres. Postgres has to do some computation, compute the result, send out some write-ahead log stuff, but also send out the acknowledgment to you. All of that is happening blocking."

With Raft, the process changes dramatically:

Order comes into the system
Before processing, ensure quorum (e.g., 3 out of 5 machines) has received the request
Process the order deterministically
Replicate the result to followers asynchronously

"If you've got a really fast consensus implementation, then you can basically handle orders as they come in, only deal with persisting those, and then your business logic can basically be processing concurrently with the durability process," Yu explains.

Why Five Nodes, Not Three?

"We've definitely been very happy that we ran with five, and not three. Because if you ran with three, you can only suffer the failure of one machine before you lose that replication. If you run with five, two machines have to die for you to be scared. You can suffer the loss of two, and then three machines have to die for you to be broken."

System Layout: Components and Interactions

The Coinbase International Exchange architecture consists of several key components:

API Gateways: Stateless components that handle incoming requests
Matching Engine Cluster: Raft cluster of 5 nodes running the core matching logic
Replicas: Strongly consistent replicas that can serve queries without perturbing the writer
Disaster Recovery Site: Can be promoted to primary when needed

How to Build an Exchange: Sub Millisecond Response Times and 24/7 Uptimes in the Cloud - InfoQ

Message Flow and Optimization

Yu highlights an important optimization enabled by determinism:

"What if, instead of doing that, my system is deterministic, why don't I just replicate the request, have the request percolate to my downstream systems that care about the output, and then I just rerun my matching logic? Now I can send that mouthful of stuff over IPC, no network cost."

This approach significantly reduces network egress costs. Instead of replicating the output events (which can be large), the system replicates the small input requests and regenerates outputs locally.

"It also allows us to have what's being transported as our request stream and not the output. I'll give you another example here. What if that was a SQL and the red stuff we were sending was the write-ahead log? What if someone submitted a request to update blah, blah, blah where ID equals anything? One request comes in and may create thousands or tens of thousands of write-ahead log entries. If your system is deterministic, I'll just replicate that really annoying query instead of suddenly having a spike in my change data capture."

Performance Optimization: Making the Hot Path Fast

The single-threaded architecture means that scalability is directly tied to the performance of the hot path. "The faster your single thread runs, the more orders you can handle," Yu emphasizes.

Key Optimization Techniques

Avoid Blocking the Hot Path: Minimize any reason for the processor to wait
Co-locate Components: Place all hot path components as close as possible
Simple Binary Encoding: Use flat binary formats instead of nested protocols
Pre-allocation: Avoid dynamic memory allocation in the hot path
CPU Pinning: Prevent the OS from moving the critical thread
Avoid Unbounded Operations: Use indexed access instead of full scans

Data Representation

"Our internal representation of messages, we use simple binary encoding. It's basically just a bunch of bytes next to each other. The CPU knows that 6 bytes after the front of the message, that's the message type. It's super quick to decide if I even care about this message. The next 8 bytes, that's the ID of the thing I want to trade. The next 8 byte is the price. The next 8 bytes are the quantity."

This approach contrasts with more complex protocols like Protocol Buffers: "Don't even bother with UUIDs, just use Snowflake IDs. You can fit a bunch of globally unique IDs that are sortable by timestamp in 64 bits. This is much better than having deeply nested protobufs, where to even figure out if you should handle that message, you've got to remarshal it, build up this whole palace of Legos that represent your highly nested message before you even can tell what the thing is."

Memory Management

"The garbage collectors on JVMs pause for milliseconds at a time. How are you building these super responsive systems on the JVM? Your garbage collector won't run if you never call new. What we're saying is, provision all of your stuff ahead of time. You should not be allocating new memory on an order by order basis."

Yu recommends using off-heap data structures: "We do things like basically put SBE structs in an off-heap map, so that we don't need to go back and forth and deal with spooky action from a distance from the garbage collector."

Deployment and Operations: Zero-Downtime Rolling Updates

The deterministic architecture enables zero-downtime deployments through rolling updates. "If you're running in a cluster like this, you can do rolling deployments. You can deploy your code without downtime that impacts your persistence SLA."

The Process

Deploy new code to a follower node
Once verified, promote it to leader
Repeat for other nodes
Finally, update the original leader

"This effectively unifies your resiliency with your deployment. It makes the unexpected normal. You can think of this as like, ok, let's shut this down, start a new version up. There's a new leader. Then I can shut another follower down, and then I'll restart the leader last. That's how you can do blue-green deployments and have effectively no downtime, on your exchange."

Why This Matters

"We have to get to 24/7. It's just so funny, not being 24/7 seems like such an anachronism in backend computing. If you're not dealing with hot trading, presumably your systems run 24/7. Any sort of website, plenty of things with much less engineering investment run 24/7."

Long maintenance windows create financial discontinuities: "When you have these long downtimes, you create weird financial discontinuities. Users who are trying to use the exchange aren't able to submit orders and do stuff, and then they have to rely on some weird side channels or whatever to manage their own financial risk. It adds complication and adds financial risk for everyone participating on the market."

Production Results and Benefits

With this architecture, Coinbase has achieved impressive results:

Spikes up to six figures transactions per second with no issues
P99 response times under one millisecond
Zero downtime deployments
Perfect reproduction of production issues for debugging

Debugging Superpower

Yu highlights a unique benefit of the deterministic architecture: "My favorite is, if something's weird, I can do full reproduction of the functional issues because everything fits on one thread. That means the entire memory set fits on one machine. It means I can run it on my laptop. If I go ahead and find something weird in the market, I can go download the logs, the request log, replay it, and replay all of that production logic in a debugger. I will literally see what each register is and all of that."

Testing and Experimentation

"The other really cool thing, I'm going to belabor this because I really like the log replay stuff, is you can stream your request log to another stack and perturb that thing. You can run experiments on your production data and do pre-production testing of your rolling deployments, pre-production testing of your configuration changes, all because you put all your logic on one nice little request stream."

This enables safe experimentation with complex topology changes: "This gives us the ability to do complicated topology changes in the cloud, which is terrifying to anybody, but we're able to literally run that with live production streaming load. That has been a superpower for a lot of us."

Trade-offs and Considerations

While this architecture delivers exceptional performance and reliability, it comes with important trade-offs:

Resource Utilization

"The one that's hot is pegged, yes," Yu confirms when asked about CPU utilization. "The only thing that needs to spin is the core logic that is handling orders as fast as possible. Everything else is just doing what normal CPUs do, which is sit idle."

This approach prioritizes predictable latency over efficient resource utilization, which is appropriate for financial systems where latency outliers can have significant financial consequences.

Vertical Scaling

Yu advocates for vertical scaling over horizontal scaling for the core matching engine: "Yes. You should vertically scale your core logic boxes. Do you just go big on number of CPUs and memory? Do you go beefy on the machines to reduce the number of machines you need to run?"

This contrasts with the horizontal scaling favored by many large internet companies: "Which is a bit of an antipattern from how Google did things. Also, CPUs are super cheap now, and so is memory."

The reasoning is simple: "What you can do is you care about clock speed, but you don't care about a lot of cores. You care about clock speed, so you can just have a small number of really hot machines, more or less."

System Complexity

While the core matching engine is intentionally simple, the overall system includes multiple components that add complexity:

Raft consensus implementation
API gateways with rate limiting
Multiple replicas for different purposes
Cross-region replication for disaster recovery

The key is to keep the hot path simple while allowing flexibility in supporting components.

Conclusion: The Engineering Ideal of Simplicity

"All of this is basically chasing. Everything you get here is chasing the Eng ideal of simplicity. Because if you're simple, it's really easy to make that simple thing stable because you can just do a lot of different things with it. You don't have to worry about all these different variables," Yu concludes.

The presentation demonstrates how focusing on determinism through single-threaded architecture, combined with consensus for durability, creates a foundation for building extremely reliable and performant financial systems. By simplifying the core logic and carefully managing the hot path, the system achieves sub-millisecond latencies while maintaining 24/7 availability.

"The thing you get is if you want to just remove a lot of the stuff in your architecture diagram, there's probably easy 10X opportunities in your system right now. That's really it. Simplify your systems, everybody."

#Infrastructure #backend #DevOps

Building High-Performance Financial Exchanges: Determinism, Consensus, and Sub-Millisecond Latency

Building High-Performance Financial Exchanges: Determinism, Consensus, and Sub-Millisecond Latency

The Challenge: Financial Infrastructure with Extreme Requirements

Core Requirements for Exchanges

Functional Requirements: Simplifying Complexity

Architecture Choice: Single-Threaded Determinism

Benefits of Single-Threaded Architecture

Consensus and Durability: Raft in the Hot Path

Why Five Nodes, Not Three?

System Layout: Components and Interactions

Message Flow and Optimization

Performance Optimization: Making the Hot Path Fast

Key Optimization Techniques

Data Representation

Memory Management

Deployment and Operations: Zero-Downtime Rolling Updates

The Process

Why This Matters

Production Results and Benefits

Debugging Superpower

Testing and Experimentation

Trade-offs and Considerations

Resource Utilization

Vertical Scaling

System Complexity

Conclusion: The Engineering Ideal of Simplicity

Comments