Rate Limiting: Concepts, Algorithms, and Distributed Challenges
#Backend

Rate Limiting: Concepts, Algorithms, and Distributed Challenges

Backend Reporter
9 min read

A comprehensive guide to rate limiting algorithms, their trade-offs, and the complexities of implementing them in distributed systems.

If you've ever built an API or backend service, you've probably faced one of these problems: One user sends too many requests, bots abusing your endpoints, traffic spikes breaking your service, retries or cron jobs accidentally overloading your system. This blog is about rate limiting, a simple but critical technique used to protect systems from these issues.

In this post, we will understand why rate limiting is needed, learn how common rate limiting algorithms work, see where rate limiting fits in real systems. You don't need prior knowledge of rate limiting. If you understand basic APIs and requests, you'll be able to follow along.

The Problem Rate Limiting Solves

When a request hits your server, it consumes resources like CPU time, memory, database connections, network bandwidth etc... Under normal usage, this works fine, but problems start when too many requests arrive at the same time. This can happen because of:

  • a single user sending requests in a tight loop
  • bots hitting public endpoints
  • retry mechanisms without proper backoff
  • sudden traffic spikes after a release or promotion

The server sees all requests as the same - it doesn't know which request is important and which one is harmful.

Why does this become a serious problem? If the request volume keeps increasing without limits: response times go up, databases start slowing down, timeouts increase, error rates spike, and eventually the service becomes unavailable for everyone.

Why can't we just "scale the server"? Our common reaction will be "let's just add more servers." Scaling helps, but it does not solve the root problem. Eventually, unlimited requests will overwhelm any system, scaling increases cost, databases, third-party APIs may not scale the same way. If we just keep on scaling, it only delays failure.

What systems really need is a way to control how fast requests are allowed: protection against accidental or intentional abuse, fairness so one user cannot starve others. This is exactly the problem rate limiting is designed to solve.

What Rate Limiting Actually Does

At its core, rate limiting controls how frequently an action is allowed within a given time period. Most commonly, this action is an API request.

A rate limit usually looks like this:

  • allow 100 requests per minute per user
  • allow 10 requests per second per IP
  • allow 1 request per second for a sensitive endpoint

When the limit is reached, the system does not process further requests until enough time has passed.

What happens when a limit is exceeded? When a client crosses the allowed limit: the request is rejected, the server responds immediately, resources are preserved for other users. In HTTP-based systems, this is commonly returned as a 429 Too Many Requests response.

This early rejection is important. It prevents unnecessary work from happening in the system, such as database queries or external API calls.

What rate limiting guarantees? Rate limiting helps systems achieve a few important guarantees:

  • Fair usage: One user cannot consume resources meant for everyone else.
  • Predictable performance: The system remains responsive even under load.
  • Controlled bursts: Some algorithms allow short bursts while still enforcing long-term limits.
  • System protection: Accidental bugs or misbehaving clients are contained early.

What rate limiting does NOT do? Rate limiting is often misunderstood, so it's important to be clear about its limits. Rate limiting is not: authentication (it does not verify who the user is), a complete security solution, a replacement for proper input validation. It is a traffic control mechanism, not a security gate.

Common Rate Limiting Algorithms

Different systems use different rate limiting algorithms depending on their needs. There is no algorithm that is "best" - each one makes different trade-offs between simplicity, accuracy, and flexibility. Let's go through the most commonly used ones.

Fixed Window Counter

This is the simplest form of rate limiting.

How it works:

  • Time is divided into fixed windows, such as 1 minute or 1 hour.
  • For each user, the system keeps a counter for the current window.
  • Every incoming request increments this counter.
  • Once the counter reaches the limit, further requests are rejected.
  • When the window ends, the counter is reset to zero.

Example:

  • Limit: 5 requests per minute
  • Window: 12:00 - 12:01
  • A user sends 5 requests at 12:00:59 -> all allowed
  • The counter resets at 12:01:00 -> another 5 requests are allowed

This means the user effectively made 10 requests in 1 second.

Why Fixed Window fails: The main issue with Fixed Window Counter is that it only looks at the clock. It does not care how requests are distributed inside the window. Because of this: users can exploit window boundaries, traffic becomes very bursty, backend services can suddenly receive huge spikes, the system becomes unfair under load.

When Fixed Window can be acceptable: very low-traffic systems, internal tools, prototypes or demos, cases where correctness is less important than simplicity. In most production APIs, Fixed Window is usually avoided.

Sliding Window

Sliding window algorithms try to fix the burst problem of fixed windows. Instead of looking at a fixed clock window, the system looks at the last N seconds from the current time.

How it works:

  • The system always looks at the last N seconds from the current time.
  • Every request is evaluated against this rolling window.
  • The system counts how many requests occurred in the previous N seconds.
  • If the count exceeds the limit, the request is rejected.

Example:

  • Limit: 100 requests per 60 seconds
  • At any moment, the system checks how many requests happened in the previous 60 seconds

This prevents sudden spikes caused by window resets, as requests are spread more evenly over time.

Sliding Window is better because: much fairer request distribution, traffic spikes are naturally reduced, no burst problems at window boundaries.

Sliding Window is harder because: the system needs to store timestamps of requests, memory usage increases with traffic, more computation is needed per request.

Token Bucket

Token Bucket is one of the most commonly used algorithms in production systems because it balances strict limits with good user experience.

How it works:

  • Each user has a bucket that holds tokens.
  • Tokens are added to the bucket at a fixed rate.
  • Each request consumes one token.
  • If there are no tokens left, the request is rejected.
  • The bucket has a maximum capacity, so tokens cannot grow infinitely.

Example:

  • Bucket size: 10 tokens
  • Refill rate: 1 token per second
  • Scenario: User is idle -> bucket fills up to 10 tokens
  • User sends 10 requests instantly -> all allowed
  • The 11th request is rejected
  • After 1 second, 1 token is added -> 1 request allowed

Token Bucket works well because: allows short bursts without breaking the system, enforces long-term rate limits, provides better user experience, simple and efficient to implement.

Because of these properties, Token Bucket is often the default choice for APIs.

Leaky Bucket

Leaky Bucket focuses on producing a smooth and stable output rate.

How it works:

  • Incoming requests are placed into a queue (the bucket).
  • Requests leave the queue at a constant, fixed rate.
  • If the queue becomes full, new requests are dropped.

Example:

  • Many requests arrive at once
  • The system processes requests at a steady pace
  • Extra requests are dropped when the queue is full

Leaky Bucket is useful because: protects downstream systems very well, ensures predictable processing speed, prevents sudden traffic spikes.

But Leaky Bucket is not ideal for APIs because burst requests are delayed or dropped, latency increases under load, user experience can suffer. Leaky Bucket is more suitable for background jobs and pipelines than user-facing APIs.

Comparing the Algorithms

The goal here is not to dive deep again, but to understand which algorithm fits which situation.

High-level Comparison:

Algorithm Burst Handling Fairness Complexity Common Usage
Fixed Window Poor Low Very Low Simple or low-traffic systems
Sliding Window Good High High Systems needing accuracy
Token Bucket Very Good High Medium Most APIs
Leaky Bucket Poor Medium Medium Background jobs

How to think about the choice:

  • Fixed Window is easy to build, but breaks under bursty traffic. It's fine for simple or internal use cases.
  • Sliding Window gives fair and accurate limits, but costs more in memory and computation.
  • Token Bucket is the most practical choice for APIs. It allows small bursts while still enforcing long-term limits.
  • Leaky Bucket focuses on smooth traffic rather than fast responses. It works better for background processing than user-facing APIs.

Challenges in Distributed Systems

So far, everything we discussed assumes a single server. In real-world applications, this is rarely the case. Most systems run on multiple servers behind a load balancer. This introduces a few important challenges for rate limiting.

Shared State Problem: If each server keeps its own rate limit counters:

  • one user can hit different servers,
  • each server thinks the user is under the limit,
  • the user ends up sending more requests than allowed.

This breaks the rate limit completely.

Consistency vs Performance: To fix this, servers need to share state. But shared state brings trade-offs:

  • strong consistency is accurate but slow,
  • weak consistency is fast but slightly inaccurate.

Most real systems accept small inaccuracies to gain performance.

Time Synchronization: Many rate limiting algorithms depend on time. In distributed systems:

  • server clocks may not be perfectly aligned,
  • small time differences can affect limits.

This needs to be handled carefully.

Key takeaway: Rate limiting becomes much harder once a system is distributed. You must: share state, accept trade-offs, balance accuracy and performance.

At a high level, this blog covered how rate limiting works, why different algorithms exist, and why things become more complex once systems are distributed. While exploring these ideas, I also implemented a complete distributed rate limiting system that applies the concepts discussed here. If you're interested in seeing a real-world implementation, you can find it here: github

Comments

Loading comments...