Airbnb migrated from StatsD to OpenTelemetry, processing 100M samples/second with 10x cost reduction through vmagent aggregation.

Airbnb's Metrics Migration: 100M Samples/Second with OpenTelemetry

Airbnb's observability engineering team has completed a large-scale migration from StatsD and a proprietary Veneur-based aggregation pipeline to a modern, open-source metrics stack built on OpenTelemetry Protocol (OTLP), the OpenTelemetry Collector, and VictoriaMetrics' vmagent. The resulting system now ingests over 100 million samples per second in production, achieving roughly an order of magnitude cost reduction compared to the previous vendor-based architecture.

The Migration Strategy

Rather than performing a phased feature migration, the team chose to front-load collection: get all metrics flowing into the new system first, then address dashboards, alerts, and user-facing tooling with real data already available. This approach minimized disruption while ensuring the new system could handle production workloads.

The primary challenge was bridging three coexisting instrumentation worlds: StatsD libraries, growing OTLP adoption, and a new Prometheus-compatible storage backend based on Grafana Mimir. Roughly 40% of Airbnb services use a shared, platform-maintained metrics library. The observability team updated this library to dual-emit metrics: sending StatsD to the legacy pipeline while simultaneously emitting OTLP to the new OpenTelemetry Collector, enabling broad migration progress with minimal friction across the service fleet.

Performance Gains and Protocol Benefits

Moving to OTLP brought measurable gains: CPU time spent on metrics processing in JVM services dropped from 10% to under 1% of total CPU samples. Beyond performance, OTLP's use of TCP eliminated packet loss risks inherent in StatsD's UDP transport, and the protocol's native support for Prometheus exponential histograms removed the need for an intermediate translation layer inside the collector.

However, the highest-cardinality services, emitting upward of 10,000 samples per second per instance, experienced memory pressure and increased garbage collection after enabling OTLP. The fix was switching those specific services to delta temporality via AggregationTemporalitySelector.deltaPreferred(), which avoids retaining full state of all metric-label combinations between exports. The trade-off accepted was that unexpected failures would now surface as visible data gaps rather than anomalous jumps.

Aggregation Architecture Decisions

Airbnb's previous pipeline aggregated away instance-level labels (such as pod and hostname) using an internal fork of Veneur before forwarding to its metrics vendor. The new Prometheus-based stack required a comparable aggregation step to keep storage costs manageable.

The team evaluated several alternative solutions:

Continuing with Veneur would have required a significant rewrite to support the Prometheus data model
Recording rules were ruled out because they require storing raw data in the TSDB before aggregation, defeating the cost-saving goal
The OpenTelemetry Collector lacked native metric aggregation support despite open proposals
Vector was eliminated due to the absence of built-in scaling and limited Rust adoption at Airbnb
m3aggregator was deemed architecturally more complex than necessary

In the end, vmagent was selected for its built-in streaming aggregation for Prometheus metrics, horizontal sharding support, approachable documentation, and a small codebase of roughly 10,000 lines that made internal customization tractable.

The production architecture uses two layers of vmagent: stateless router pods that shard metrics by consistently hashing all labels except the ones to be aggregated away, and stateful aggregator pods that maintain running totals in memory. Routers are configured with a static list of aggregator hostnames, leveraging Kubernetes StatefulSet stable network identities to avoid additional service discovery dependencies. The production cluster scaled to hundreds of aggregators.

Several generic improvements were contributed back to the VictoriaMetrics upstream project.

The Zero Injection Solution

After completing the collection migration, the team observed that PromQL queries over certain counters consistently underreported compared to the legacy vendor. The root cause was an edge case in how Prometheus handles counter resets at low emission rates.

In StatsD, each data point represents a delta relative to the flush window. In Prometheus, data points represent cumulative counts, and the rate() function derives deltas — but if a counter increments once and its pod restarts before it can increment again, that increment is lost before rate() can compute a meaningful delta.

At Airbnb, this edge case proved more impactful than anticipated. Many counters track high-dimensional, low-frequency events — requests per currency per user per region, for example — where any given label combination may increment only a few times per day. These were frequently business-critical metrics, and their systematic undercounting stalled user migration.

The team rejected pre-initializing all counters to zero (impractical at scale with unpredictable label combinations), replacing counters with gauges (counter-conventional in Prometheus), and pushing workaround PromQL onto every dashboard and alert author.

The chosen solution was a transparent "zero injection" technique implemented inside the vmagent aggregation tier: the first time an aggregated counter is flushed, the aggregator emits a synthetic zero rather than the actual running total. The real accumulated value is flushed on the subsequent interval. This implicitly initializes every counter to zero, matching Prometheus semantics, while the delayed first flush ensures the synthetic zero's timestamp does not collide with any prior samples. The trade-off is a single flush-interval lag on the first recorded increment.

Industry Context and Broader Patterns

Flipkart and Shopify tackled versions of the same underlying problem from different angles. Flipkart's engineers faced 80 million simultaneous time-series from roughly 2,000 API Gateway instances, a scale where StatsD had already collapsed under the weight of long-range queries, and solved the aggregation problem using hierarchical Prometheus federation, local servers strip high-cardinality instance labels via recording rules before exposing summarised series upward through /federate endpoints.

Shopify's motivation was primarily financial: three separate vendors for metrics, logs, and traces had become ruinously expensive as the platform scaled, and the team rebuilt from scratch onto Prometheus, Loki, Tempo, and Grafana under a single in-house platform called Observe, using the same dual-write pattern Airbnb would later employ to validate parity before cutting over.

Neither post goes as deep as Airbnb's on the specific mechanics of aggregation or the correctness hazards of counter semantics. Together, they confirm that the StatsD exit, the vendor cost reckoning, and the move toward a Prometheus-anchored open-source stack are not idiosyncratic choices. They are a pattern playing out across hyperscale engineering organisations wherever the legacy push-based monitoring model has hit its ceiling.

Lessons for Large-Scale Observability

The migration demonstrates several key principles for large-scale observability transformations:

Dual-write validation: By emitting to both legacy and new systems simultaneously, Airbnb could validate correctness before cutting over, reducing risk during the transition.

Protocol modernization: Moving from UDP-based StatsD to TCP-based OTLP eliminated packet loss and enabled more sophisticated metric types like exponential histograms.

Aggregation as cost control: The vmagent-based aggregation tier was essential for keeping storage costs manageable while preserving the ability to query at different levels of granularity.

Semantic correctness matters: The zero injection solution shows how subtle differences between monitoring systems can have significant business impact, requiring careful attention to edge cases.

Open source flexibility: Building on open-source components like OpenTelemetry and VictoriaMetrics provided the flexibility to customize and optimize for Airbnb's specific requirements.

The completed pipeline processes over 100 million samples per second in a single production cluster, with cost reduced by roughly an order of magnitude compared to the previous vendor-based architecture. The centralized aggregation tier also became a general-purpose transformation layer: operators can drop problematic metrics caused by bad instrumentation changes without touching application code, or temporarily dual-emit raw metrics for debugging purposes.

This migration represents a mature approach to observability at scale, demonstrating how organizations can transition from legacy monitoring systems to modern, open-source alternatives while maintaining reliability and reducing costs.

#OpenTelemetry #Metrics #Observability #Prometheus #VictoriaMetrics

Airbnb's Metrics Migration: 100M Samples/Second with OpenTelemetry

Airbnb's Metrics Migration: 100M Samples/Second with OpenTelemetry

The Migration Strategy

Performance Gains and Protocol Benefits

Aggregation Architecture Decisions

The Zero Injection Solution

Industry Context and Broader Patterns

Lessons for Large-Scale Observability

Comments