Netflix's Global Fleet Architecture: Balancing Efficiency and Reliability at Scale

Netflix shares sophisticated techniques for managing their global service fleet, focusing on buffer management, capacity planning, and traffic shaping to maintain both efficiency and reliability across diverse workloads.

In today's cloud-native landscape, organizations face the fundamental challenge of balancing efficiency with reliability. At QCon San Francisco, engineers from Netflix shared their approach to managing a global fleet of services that serves millions of users worldwide. Their presentation revealed the sophisticated strategies Netflix employs to navigate the inherent tension between cost optimization and service availability.

The Efficiency-Reliability Dilemma

Joseph Lynch and Argha C from Netflix opened by highlighting the universal struggle faced by engineering teams: being asked simultaneously to save money on services and ensure they work flawlessly. "How many of you have been asked to save money on your services? How many have been asked to make it work all the time?" Lynch asked the audience. "How many have been a little frustrated by how that's like an inherent tension?"

Netflix's scale amplifies this challenge. As a global business serving customers across diverse platforms—from mobile devices to PCs to TVs—they must maintain a fully active 4-Amazon region deployment for control plane activities while operating a world-scale CDN called Open Connect for content delivery. This infrastructure must handle everything from critical playback services to best-efficient background processes.

Rethinking Efficiency: Beyond CPU Utilization

A key insight from Netflix's approach is redefining efficiency beyond simple CPU utilization. Instead, they focus on "risk-adjusted net value," which considers:

Service value to the business
Operational costs
Potential failure costs

"Failures also have cost," Lynch explained. "When a service fails, that has a cost to your business." He illustrated this with a striking statistic: Amazon loses approximately $20,000 per second of downtime. At Netflix's scale, even brief service interruptions can translate to significant business impact.

Netflix categorizes services into three tiers based on their loss function:

Tier 0: No fallback (e.g., playback services—without these, Netflix doesn't work)
Tier 1: Degraded service available (e.g., personalized thumbnails with fallback options)
Tier 2: Best-effort services (minimal business impact if unavailable)

The company's efficiency model allocates resources accordingly, prioritizing capacity for critical services while optimizing costs for less critical ones.

Understanding Hardware Supply

Netflix's approach to hardware supply centers on the concept of "buffer"—the ratio of offered load that a service can accept successfully. "We define buffer as the ratio over offered load that a service can accept successfully," Lynch stated. "For example, in this case, you can double the traffic on the service and it will be ok. It will respond successfully."

Buffer management involves several key considerations:

Service criticality and business impact
Recovery speed when entering buffer zones
Hardware characteristics (newer generations offer better performance but carry supply risks)

Netflix employs a strategic mix of reserved and on-demand instances to balance cost and availability. Reserved instances provide guaranteed capacity for critical, slow-scaling services, while on-demand instances offer flexibility for variable workloads.

The company has also developed sophisticated models to optimize hardware selection based on workload characteristics, pricing, and capacity availability. "We've spent a lot of time, we've open-sourced a library where we provide an apples-to-apples comparison that allows us to say if we take a workload that's running on this computer and we run it on that one, how do we expect it to look?" Lynch explained.

Understanding Software Demand

On the demand side, Netflix focuses on understanding workload profiles and scaling behaviors. "The first key component to understanding demand is you need to understand how individual workloads behave, which means that you need to understand at a workload level their profiles," Argha C explained.

These profiles include:

CPU utilization patterns
Memory requirements (including allocation rates for JVM applications)
Network utilization (particularly important for stateful services)

Netflix also analyzes how workloads scale, observing production behavior to derive scaling targets. "To do that, the simplest technique is observe the workload in production," C noted. "Look at what load it serves. Typically for us, it can be quite predictable, the load."

A critical consideration is startup time. "You need to factor in startup times, because it's not just enough to say that I need compute, bang, schedule compute, and it's there," C emphasized. "Slow starting services need more buffer to accommodate this delay."

Balancing Supply and Demand

With understanding of both supply and demand, Netflix employs several strategies to maintain balance:

Pre-scaling

The simplest approach is pre-scaling the fleet based on predicted demand. "We have reasonably good estimates of what load looks like," C explained. "We try and estimate how much viewership we are going to get, like how many viewers are going to watch this popular event or title we're doing."

Effective pre-scaling maintains appropriate buffers for different service tiers without over-provisioning. "In practice, an efficient pre-scale, this is a real service, what it looks like is you can see that we've scaled up the service, the traffic comes in, and that blue line is very interesting," C illustrated, showing minimal transient load shedding during scaling events.

Dynamic Traffic Shaping

Netflix employs sophisticated traffic management to distribute demand across their global infrastructure. "We have to account for things that we cannot control," C noted, citing unpredictable content popularity spikes during new releases.

Their approach includes:

Redistributing existing traffic across regions to balance load
Steering new traffic based on regional capacity
Using authoritative resolvers and DNS tiering to control traffic flow

"When we did apply shaping, you see how more balanced this traffic looks like," C showed, comparing unbalanced regional traffic to their optimized distribution. "This has significant implications for how we provision compute."

Reactive Autoscaling

When proactive measures aren't sufficient, Netflix relies on reactive autoscaling with carefully calibrated thresholds. "We have to determine three things," C explained. "The first thing is what we call target tracking. It's a number that basically measures for a given service on a given hardware at what number, at what CPU percentage should you start scaling out."

Their system includes "hammers"—emergency capacity injections when approaching failure buffers. "When you're pushing in right end of that success buffer, which means if you went further, you would have to load shed, you also need emergency injection of capacity, and that's called hammers," C detailed.

Load Shedding: Protecting Critical Services

When even reactive measures can't keep up, Netflix employs priority-based load shedding. "Load shedding is two types. It's like you can shed load in bulk, which is undiscriminated shedding," C explained. "However, what we want the shedding to look like in practice is that we would want to start with the least critical traffic, which is what we call bulk, and progressively move to best effort degraded. The last traffic you want to shed is critical traffic."

This approach ensures that critical services remain available even under extreme load. "In production, what it looks like is your non-critical shedding starts much earlier, and way before you have to even touch critical shedding," C demonstrated. "While your errors go up in between, fundamentally, you've not dropped RPS success."

Author photo

Lessons and Best Practices

Netflix's experience offers several valuable lessons for organizations managing large-scale cloud infrastructure:

Invest in both proactive and reactive measures: "Please invest both in proactive and reactive levers," Lynch emphasized. "One is not sufficient."
Adopt a systems thinking approach: "Everything that we are talking about needs holistic solutions," C noted. "You have to have systems thinking. You can't just buy raw compute to buy down risk."
Maintain end-to-end traffic control: "You need end-to-end control over your traffic," Lynch stated. "A lot of what we are talking about is not possible unless you do that."
Leverage math and economics: "Math, for both of us, and for Netflix, is our safety blanket," Lynch concluded. "When you combine math with some Econ 101, you can solve hard problems and make a lot of progress."

Netflix's approach demonstrates that efficiency and reliability aren't opposing forces but complementary objectives that can be balanced through careful design, mathematical modeling, and comprehensive system thinking. Their global fleet architecture serves as a valuable reference for organizations navigating the complexities of large-scale cloud infrastructure.

For those interested in deeper technical details, Netflix has open-sourced several tools and libraries related to their capacity planning and optimization efforts. The company's presentations at events like QCon San Francisco continue to provide valuable insights into cutting-edge cloud architecture practices.

#cloud architecture #fleet management #capacity planning #traffic shaping #Reliability Engineering