Tracing Discord's Elixir Systems (Without Melting Everything)

Discord built custom distributed tracing for its Elixir-based guild system to debug performance issues without downtime, overcoming the language's lack of built-in metadata propagation.

At Discord, the experience of chatting with friends, reacting to messages, or sharing memes needs to feel instantaneous. The platform achieves this at scale by leveraging Elixir's powerful concurrency mechanisms to run each Discord server—what they call a "guild" internally—fully independently from one another. This architectural choice allows for impressive performance, but when things go wrong and a guild can't keep up with user activity, the consequences ripple through the user experience.

When a guild becomes overwhelmed, users experience lag or potentially complete outages. If the system degrades beyond its ability to self-heal, on-call engineers must intervene. Their investigation typically begins with metrics and logs—Discord has extensive instrumentation measuring how frequently each user action type is processed and how long processing takes. These metrics often reveal bursty activity patterns, like a flurry of reactions to a newly released game, but they fall short of showing the complete picture of what users actually experienced. As the engineering team puts it, it's like looking at your car's dashboard: you can see the engine temperature, but not the consequences of it running hot.

When metrics don't yield answers, engineers turn to Discord's custom-built "guild timings" tool. Every time a guild processes an action, it records how much of the current minute has been spent on each action type in an in-memory store. This provides much more detailed data than standard metrics, but the volume is so high that storing it all isn't feasible. The data rotates frequently for all but the largest guilds, and even when retrieved in time, it still doesn't capture downstream effects or provide a complete picture of the end-to-end user experience.

Other teams at Discord have found enormous value in distributed tracing, also known as Application Performance Monitoring, which reveals how long the constituent parts of an operation took. However, adding tracing to their Elixir stack presented unique challenges. Most tracing tools work by passing operation information through metadata layers like HTTP headers, but Elixir's built-in communication tools lack an equivalent layer out of the box. This meant Discord had to build their own solution from scratch.

The engineering team faced a significant challenge: they needed to change how their services communicate with one another without causing downtime. This is particularly tricky when dealing with a production system serving millions of users. The solution required careful planning and execution to ensure that the transition to distributed tracing wouldn't disrupt the very service they were trying to improve.

Discord's approach to this problem demonstrates the kind of engineering ingenuity required to maintain complex systems at scale. By building custom tracing infrastructure that works with Elixir's unique characteristics, they've created a tool that will help their on-call engineers diagnose and resolve issues more effectively in the future. The ability to see the complete journey of a user action through the system—from initial request to final response—will provide insights that were previously difficult or impossible to obtain.

This work highlights an important truth about modern software engineering: sometimes the tools you need don't exist yet, and you have to build them yourself. Discord's custom tracing solution is a testament to their commitment to understanding and improving their systems, even when it means tackling complex technical challenges. The result will be better visibility into system performance, faster incident resolution, and ultimately, a better experience for Discord users around the world.

The implementation of distributed tracing in Discord's Elixir infrastructure represents more than just a technical achievement—it's a strategic investment in the platform's reliability and performance. As Discord continues to grow and evolve, having the ability to quickly identify and resolve performance bottlenecks will become increasingly critical. This custom-built tracing system provides the visibility needed to maintain the instantaneous, seamless experience that Discord users expect, even as the platform scales to handle ever-increasing user activity and complexity.

For engineering teams working with Elixir or similar languages that lack built-in support for distributed tracing, Discord's experience offers valuable lessons. Building custom tracing infrastructure is possible without downtime, but it requires careful consideration of the language's unique characteristics and communication patterns. The effort invested in creating these tools pays dividends in improved system observability and faster incident response times.

Discord's journey to implement distributed tracing in their Elixir-based guild system showcases the ongoing evolution of observability practices in modern distributed systems. As platforms grow more complex and user expectations for performance continue to rise, the ability to trace requests across service boundaries becomes increasingly essential. Discord's custom solution not only solves their immediate needs but also contributes to the broader conversation about how to achieve effective observability in Elixir and similar functional programming environments.

#Elixir #Distributed Tracing #Observability #Performance Monitoring #Discord

Tracing Discord's Elixir Systems (Without Melting Everything)

Comments