DoorDash Transforms Experimentation with Multi-Armed Bandits, Moving Beyond Traditional A/B Testing
#Machine Learning

DoorDash Transforms Experimentation with Multi-Armed Bandits, Moving Beyond Traditional A/B Testing

Serverless Reporter
7 min read

Traditional A/B testing at scale creates significant opportunity costs and slows iteration cycles. DoorDash engineers Caixia Huang and Alex Weinstein detail how they implemented multi-armed bandits (MAB) using Thompson sampling to dynamically allocate traffic, reducing waste while accelerating learning. The approach balances exploration and exploitation but introduces new challenges in metric analysis and user experience consistency.

Traditional A/B testing at scale creates a fundamental tension: organizations must run experiments long enough to achieve statistical significance, yet every moment spent serving a suboptimal variant to users represents lost revenue and degraded experience. DoorDash engineers Caixia Huang and Alex Weinstein identified this as a critical bottleneck in their experimentation platform. Their solution—adopting multi-armed bandits (MAB)—represents a shift from static, predetermined traffic splits to dynamic, adaptive allocation that learns and optimizes in real time.

Featured image

The Cost of Static Experiments

In conventional A/B testing, traffic splits remain fixed throughout an experiment. Even when one variant clearly outperforms others early on, the experiment continues until reaching a predetermined stopping condition—typically a sample size calculated for statistical power. This creates what economists call "regret": the cumulative loss from serving inferior options during the experiment.

The problem compounds exponentially with concurrent experiments. When dozens of experiments run simultaneously, each consuming a portion of user traffic, the aggregate opportunity cost becomes substantial. Teams often respond by running experiments sequentially rather than in parallel, which reduces regret but dramatically slows iteration velocity—a critical metric for competitive advantage.

DoorDash's MAB approach fundamentally rethinks this model. Instead of treating experiments as static comparisons, it frames them as continuous learning problems where the system actively seeks to maximize reward while minimizing regret.

How Multi-Armed Bandits Work

The multi-armed bandit analogy originates from probability theory: a gambler faces multiple slot machines ("one-armed bandits") with unknown payout rates and must decide which to play, how often, and when to try alternatives. Each machine represents an experimental variant, and each pull represents a user interaction.

The core challenge is balancing two competing objectives:

  1. Exploration: Learning about all available options by trying each sufficiently
  2. Exploitation: Prioritizing the best-performing options to maximize immediate reward

Too much exploration wastes traffic on inferior variants. Too much exploitation risks missing better options that haven't been tested enough. MAB algorithms dynamically adjust this balance based on accumulating evidence.

DoorDash implemented Thompson sampling, a Bayesian approach that has proven particularly effective for their use case. Thompson sampling works by maintaining probability distributions over each variant's expected reward. At each decision point—typically every few seconds or minutes—the algorithm samples from these distributions to decide which variant to serve. As new feedback arrives, it updates the distributions, making the system progressively more confident about each option's true performance.

This Bayesian foundation provides several advantages. First, it naturally handles uncertainty: variants with limited data have wider probability distributions, encouraging more exploration. Second, it's robust to delayed feedback, which is common in e-commerce where conversion events may occur hours after initial interaction. Third, it provides a principled way to incorporate prior knowledge about expected performance.

Implementation Architecture

While the article doesn't detail DoorDash's specific infrastructure, typical MAB implementations require several components:

Decision Engine: A low-latency service that receives context (user ID, experiment name, available variants) and returns the selected variant. This must be fast—sub-10ms latency is typical—to avoid impacting user experience.

Feedback Pipeline: A streaming system that captures user interactions and outcomes. In DoorDash's case, this likely includes clicks, conversions, order values, and other business metrics. The pipeline must handle high throughput and provide timely updates to the decision engine.

Model Storage: Thompson sampling requires maintaining probability distributions for each variant. These distributions are typically Gaussian for continuous rewards or Beta distributions for binary outcomes. The storage system must support frequent updates and concurrent access.

Experiment Configuration: A service that defines experiments, including which variants to test, reward functions, and stopping conditions. This is where teams specify what constitutes "success" for their experiment.

Monitoring and Observability: Comprehensive dashboards showing allocation percentages, reward rates, and statistical confidence intervals. Unlike traditional A/B testing where results emerge at experiment conclusion, MAB requires continuous monitoring to detect issues like algorithmic bias or unexpected behavior.

Trade-offs and Challenges

Adopting MAB introduces significant trade-offs that DoorDash engineers explicitly acknowledge.

Metric Analysis Limitations

Traditional A/B testing provides a clean experimental design: fixed traffic splits, predetermined sample sizes, and post-experiment analysis of any metric. Once the experiment concludes, teams can analyze secondary metrics, segment by user characteristics, or perform exploratory analysis without concern for multiple comparison problems.

MAB breaks this model. Because traffic allocation changes dynamically based on the reward function, the experimental design becomes non-orthogonal. Analyzing metrics not included in the reward function becomes statistically problematic. If the algorithm allocates more traffic to variant A because it performs well on click-through rate, but you later want to analyze conversion rate, the allocation bias confounds the results.

This forces teams to choose reward functions more carefully, often requiring more complex metrics that capture multiple business objectives simultaneously. For DoorDash, this might mean combining order value, conversion rate, and user retention into a single reward signal—each with its own weight and measurement challenges.

User Experience Consistency

MAB's aggressive allocation adjustments can create inconsistent experiences for returning users. If a user interacts with a feature multiple times across sessions, they might see different variants each time as the algorithm reallocates traffic. This can be jarring for users and may affect their perception of product quality.

Traditional A/B testing avoids this by maintaining consistent assignment throughout an experiment. Each user sees only one variant for the duration, ensuring a coherent experience.

Algorithmic Complexity

While Thompson sampling is conceptually elegant, implementing it correctly requires careful attention to numerical stability, handling of edge cases, and validation of assumptions. The Bayesian approach also requires specifying prior distributions, which can influence results, especially early in experiments when data is sparse.

Future Directions

DoorDash plans to address these limitations through several advanced techniques:

Contextual Bandits: Traditional MAB treats each decision independently, ignoring user context. Contextual bandits incorporate features about the user, device, and situation into the decision process. For example, a new user might see more exploration, while a power user might see more exploitation. This improves both learning efficiency and user experience.

Bayesian Optimization: For experiments with continuous parameters (like pricing or UI element sizes), Bayesian optimization can search the parameter space more efficiently than discrete A/B testing. This is particularly valuable for fine-tuning algorithms where the optimal configuration lies in a continuous space.

Sticky Assignment: To address user experience consistency, DoorDash can implement sticky assignment that ensures returning users see the same variant for a defined period (e.g., 24 hours) even as allocations shift. This requires careful coordination between the decision engine and user identification systems.

Broader Implications

DoorDash's adoption of MAB reflects a broader trend in data-driven organizations: moving from batch experimentation to continuous optimization. This shift is enabled by several factors:

  1. Streaming Infrastructure: Modern data pipelines can process user feedback in real time, making dynamic allocation feasible.

  2. Computational Resources: Cloud computing makes it economical to run complex algorithms at scale.

  3. Organizational Maturity: As teams become more sophisticated in experimentation, they can handle the increased complexity of adaptive methods.

However, this transition isn't universally applicable. MAB works best for:

  • High-volume user interactions where decisions can be made frequently
  • Clear, measurable reward functions
  • Environments where user experience consistency is less critical
  • Teams with strong statistical and algorithmic expertise

For low-volume experiments or situations requiring precise measurement of multiple metrics, traditional A/B testing remains valuable. The key is matching the methodology to the problem.

Conclusion

DoorDash's implementation of multi-armed bandits demonstrates how machine learning can optimize business processes beyond traditional prediction tasks. By treating experimentation as a sequential decision problem, they've reduced opportunity costs while accelerating learning velocity.

The approach isn't without trade-offs. Teams must accept less flexibility in post-experiment analysis and invest more effort in designing appropriate reward functions. User experience consistency requires additional engineering. But for organizations running hundreds of concurrent experiments, the benefits of adaptive allocation can outweigh these costs.

As more companies adopt MAB, we'll likely see standardization of algorithms, tooling, and best practices—much as A/B testing has evolved from custom implementations to standardized platforms. The future of experimentation may not be choosing between static and adaptive methods, but rather having both available and knowing when each is appropriate.

For teams considering this transition, DoorDash's experience suggests starting with high-volume, clear-reward use cases while building the infrastructure and expertise needed for more complex scenarios. The investment in streaming data pipelines, low-latency decision services, and statistical monitoring pays dividends across the experimentation portfolio.

Author photo

This article is based on the InfoQ presentation "Enhancing A/B Testing at DoorDash with Multi-Armed Bandits" by Caixia Huang and Alex Weinstein. For more details on the implementation and results, see the original InfoQ article.

Comments

Loading comments...