Why Discord Keeps Rewriting Its Stack: A Case Study in Distributed Systems Evolution

Discord's journey through multiple stack rewrites reveals critical lessons about distributed systems scaling, performance optimization, and pragmatic engineering decisions. Their evolution from MongoDB to ScyllaDB, and from Elixir/Go to Rust, demonstrates how data-driven thresholds—not hype—should guide architectural changes.

Discord's stack evolution reads like a masterclass in distributed systems scaling. What began as a straightforward startup technology stack has undergone multiple strategic rewrites, each driven by specific scaling challenges rather than technological fads. This journey offers valuable insights for anyone building systems that must grow from thousands to hundreds of millions of users.

The Initial Architecture: 2013

Discord's founding stack represented pragmatic choices optimized for rapid development and initial scalability needs:

Elixir: Leveraging the BEAM VM for real-time messaging with its inherent concurrency model
Python: For API services and business logic
Go: For microservices requiring performance
MongoDB: As the primary database
Electron: For cross-platform desktop client

This combination allowed Discord to ship quickly while providing adequate performance for their initial user base. The choices reflected common industry practices at the time—prioritizing development speed and concurrency over extreme optimization.

The First Crisis: MongoDB's Scaling Limits (2017)

By 2017, Discord reached 5 million users and encountered MongoDB's fundamental scaling limitations. The database couldn't keep up with the read/write demands of their growing user base.

Their solution was a migration to Apache Cassandra, a distributed NoSQL database designed for horizontal scaling. They deployed 12 nodes to handle their load, which proved adequate for several years. This decision reflected a common pattern in distributed systems: when a component becomes a bottleneck, replace it with one designed for scale.

The Cassandra migration demonstrated Discord's first scaling principle: measure, identify bottlenecks, and replace components that can't meet performance requirements. This approach would guide their subsequent architectural decisions.

The Second Database Migration: Cassandra to ScyllaDB (2022)

Five years later, those 12 Cassandra nodes had grown to 177. Maintenance complexity increased, operational costs climbed, and performance began to degrade. The Cassandra architecture, while horizontally scalable, became operationally expensive at Discord's scale.

Their solution was another database migration, this time to ScyllaDB. ScyllaDB, written in C++, shares Cassandra's data model but offers significantly better performance through a non-blocking I/O architecture and better CPU utilization.

This nine-year journey through three databases (MongoDB → Cassandra → ScyllaDB) illustrates a critical distributed systems principle: no database remains optimal forever. As data volumes, access patterns, and scale requirements change, so must storage technologies.

The Language Shift: Elixir to Rust (2019)

While database migrations addressed storage bottlenecks, Discord also faced performance challenges in their application code. Specifically, sorting large data sets in Elixir was taking 170ms per operation—an unacceptable latency at Discord's scale.

The solution was a targeted rewrite of these components in Rust. The result was dramatic: latency dropped from 170ms to just 1ms. This 170x performance improvement demonstrates how the right language choice can solve specific performance problems.

This migration highlights an important nuance in Discord's approach: they didn't abandon Elixir entirely. The language still handles real-time messaging where its concurrency model excels. Instead, they surgically replaced components where Elixir's characteristics became limitations.

The Garbage Collector Problem: Go to Rust (2020)

Discord's Read States service, which tracks every message a user has read, presented another challenge. This service gets hit on nearly every user action—opening the app, sending a message, reading a message. The Go implementation suffered from garbage collection pauses.

Go's garbage collector runs every 2 minutes, scanning all memory. With millions of user states cached, this scanning caused significant latency spikes. Discord tried tuning the garbage collector and upgrading Go versions, but couldn't eliminate the problem.

Their solution was another Rust rewrite. Rust's ownership model eliminates the need for a garbage collector—memory frees immediately when no longer needed. This change eliminated the scanning pauses, reducing latency from milliseconds to microseconds.

This migration coincided with the pandemic-driven user surge to 100 million monthly active users. The timing wasn't lucky—it was necessary. The Rust rewrite prevented what would have been a critical scaling bottleneck.

What They Didn't Change

Notably, Discord maintained several components throughout these rewrites:

Elixir for real-time messaging: The BEAM VM's concurrency model remains optimal for their real-time chat needs
Python for APIs: Sufficient for their API gateway and business logic
Electron for desktop: Despite its performance reputation, it met their cross-platform needs
React Native for mobile: Eventually unified iOS and Android after separate native implementations

This consistency demonstrates another principle: replace only what needs replacing. The BEAM VM's ability to handle millions of concurrent processes with instant restarts made it ideal for Discord's core messaging functionality. Premature optimization or unnecessary rewrites would have added complexity without benefit.

The Pattern Behind the Changes

Every switch in Discord's stack followed a consistent pattern:

Identify a specific bottleneck through measurement
Determine that tuning couldn't solve the problem
Choose a technology that addresses the specific limitation
Implement the change with minimal disruption to other components

No switch was made for hype. No switch was made early. Each change responded to concrete scaling challenges that had emerged through operation.

Broader Implications for Distributed Systems

Discord's journey offers several lessons for distributed system architects:

Threshold-Based Architecture

Discord's approach embodies threshold-based architecture—components are replaced when specific metrics cross predefined thresholds. This contrasts with architectural approaches that attempt to design for maximum scale from day one, which often leads to over-engineering and unnecessary complexity.

Surgical Replacements

Discord didn't wholesale replace their stack. They made surgical replacements of specific components. This approach minimizes risk while addressing bottlenecks. It requires careful component isolation but offers significant operational benefits.

Language Specialization

Their migration from Elixir to Rust for specific operations demonstrates the value of language specialization. Different languages excel in different domains, and the optimal architecture often leverages multiple languages for their respective strengths.

Operational Cost Awareness

The Cassandra to ScyllaDB migration highlights the importance of considering operational costs, not just raw performance. As systems scale, operational complexity can become as significant a bottleneck as raw performance.

The Future of Discord's Stack

Discord's evolution continues. As they approach 500 million users, new bottlenecks will inevitably emerge. Their demonstrated approach—measure, identify thresholds, replace components surgically—will likely guide their future architectural decisions.

The pattern suggests they'll continue leveraging multiple languages and databases, each selected for specific operational characteristics rather than ideological purity. Their architecture will likely become increasingly heterogeneous, with each component optimized for its specific role.

Lessons for Other Systems

Discord's journey offers several actionable lessons:

Measure everything: You can't optimize what you don't measure
Replace components, not systems: Surgical changes minimize risk while solving problems
Wait for thresholds: Don't replace components before they become bottlenecks
Consider operational costs: Performance isn't the only factor in scaling
Leverage language specialization: Use the right tool for each job

Discord's stack evolution isn't about finding the "perfect" technology stack. It's about creating a process for continuous adaptation as scaling challenges emerge. Their approach demonstrates that successful distributed systems aren't designed upfront—they evolve through operational experience and data-driven decisions.

For more details on Discord's architecture, you can explore their engineering blog posts or presentations from their tech talks. The patterns they've developed offer valuable insights for anyone building systems that must scale from thousands to hundreds of millions of users.

Featured image: Discord's stack evolution represents a pragmatic approach to distributed systems scaling, replacing components only when specific thresholds are crossed.

For additional context on scaling patterns, you might find Martin Kleppmann's Designing Data-Intensive Applications particularly relevant, as it covers many of the principles Discord's evolution exemplifies.

#distributed systems #Scaling #database-migration #Rust #Elixir