Uber and OpenAI Retool Rate Limiting Systems for Scale and User Experience

Both Uber and OpenAI have moved from counter-based, per-service rate limits to adaptive, policy-based systems implemented at the infrastructure layer, replacing hard stops with soft controls that maintain system resilience while preserving user momentum.

In recent blog posts, both Uber and OpenAI have detailed significant architectural shifts in their approach to rate limiting, moving from counter-based, per-service limits to adaptive, policy-based systems implemented at the infrastructure layer. These systems feature soft controls that manage traffic by asserting pressure on clients rather than utilizing hard stops - either through probabilistic shedding or credit-based waterfalls - ensuring system resilience without sacrificing user momentum.

Previously, Uber engineers implemented rate limits per service, commonly using token buckets backed by Redis. This caused operational inefficiencies, such as additional latency and the need for deployments just to adjust thresholds. Inconsistent configurations increased maintenance risk and resulted in uneven protection, leaving some smaller services without any limits. Additionally, observability was fragmented, making it difficult to pinpoint problems caused specifically by rate limiting.

Uber replaced these legacy limiters with a new Global Rate Limiter (GRL). The GRL architecture consists of a three-tier feedback loop: rate-limit clients in Uber's service mesh data plane enforce decisions locally, zone aggregators collect metrics, and regional controllers calculate global limits to push back to the clients. GRL also replaced hard-stop buckets with a system that drops a configurable percentage of traffic (e.g., 10%). This policy acts as a soft limit that exerts pressure on caller services, allowing them to remain operational rather than being shut down due to exhausted quotas.

OpenAI implemented its new rate limiter with a similar architecture; however, the primary driver was the user experience of the Codex and Sora applications rather than operational resiliency. With growing adoption, OpenAI saw a consistent pattern: users found significant value in the tools only to be interrupted by rate limits. While these boundaries ensured fair access and system stability, they frequently frustrated engaged users. OpenAI sought a way to maintain momentum without discouraging exploration through immediate usage-based billing.

The engineering team designed a combined approach that allows users to access the system up to a limit, after which the system deducts from a credit balance. The team describes this decision-making process as a "waterfall":

This model reflects how users actually experience the product. Rate limits, free tiers, credits, promotions, and enterprise entitlements are all just layers in the same decision stack. From a user's perspective, they don't "switch systems" - they just keep using Codex and Sora. That's why credits feel invisible: they're just another element in the waterfall.

To ensure this transition is seamless, OpenAI built a dedicated real-time access engine that consolidates usage tracking, rate-limit windows, and credit balances into a single evaluation path. Unlike traditional asynchronous billing systems that suffer from lag, this engine makes a provably correct decision synchronously: every request identifies the available capacity in the rate-limit tier before instantly checking for a credit balance if that limit is exceeded. To maintain low latency, the system settles credit debits asynchronously through a streaming processor, using stable idempotency keys to prevent double-charging. This architecture relies on three tightly coupled data streams - product usage events, monetization events, and balance updates - ensuring every transaction is auditable and reconcilable without interrupting the user's creative flow.

Both Uber and OpenAI report that these architectural shifts have successfully met their respective operational and product goals. At Uber, the implementation of the Global Rate Limiter has scaled to process over 80 million requests per second across 1,100 services, significantly reducing tail latency by removing external Redis dependencies. The system has demonstrated its effectiveness in production by absorbing a 15x traffic surge without degradation and mitigating DDoS attacks before they reached internal systems.

Similarly, OpenAI has integrated its credit system into the access path for Codex and Sora, replacing hard stops with a continuous waterfall model. The platform provides real-time, accurate billing while maintaining the low-latency performance required for interactive AI applications.

For both companies, the move to in-house, infrastructure-level platforms has replaced manual configuration with automated, adaptive controls, allowing their respective fleets to handle massive scale with minimal human intervention.

This architectural evolution represents a broader trend in distributed systems design, where companies are moving away from rigid, per-service controls toward more sophisticated, adaptive systems that can respond to changing conditions while maintaining both operational resilience and user experience. The shift from hard limits to soft controls - whether through probabilistic shedding or credit-based waterfalls - demonstrates how infrastructure decisions can directly impact product experience and business outcomes.

The success of these systems also highlights the importance of building rate limiting into the infrastructure layer rather than treating it as an application concern. By implementing these controls at the service mesh level, both companies have created systems that can scale automatically and respond to changing conditions without requiring manual intervention or service-specific configuration.

As distributed systems continue to grow in complexity and scale, the approaches taken by Uber and OpenAI offer valuable lessons for other organizations facing similar challenges. The key insight is that effective rate limiting is not just about preventing system overload - it's about creating systems that can gracefully handle traffic variations while maintaining both operational stability and user satisfaction.

#rate-limiting #distributed systems #Infrastructure #Software Architecture #Scaling

Uber and OpenAI Retool Rate Limiting Systems for Scale and User Experience

Comments