LinkedIn Replaces ZooKeeper with Kafka and xDS for Scalable Service Discovery
#Infrastructure

LinkedIn Replaces ZooKeeper with Kafka and xDS for Scalable Service Discovery

Infrastructure Reporter
2 min read

LinkedIn engineers migrated their service discovery system from Apache ZooKeeper to a Kafka-backed architecture with xDS protocol integration, achieving sub-second latency at scale while enabling polyglot client support.

Featured image

LinkedIn has completed a multi-year migration of its service discovery infrastructure, replacing Apache ZooKeeper with an eventual-consistency architecture built on Apache Kafka and the xDS protocol. This architectural shift addresses fundamental scalability limitations while supporting LinkedIn's growth to hundreds of thousands of microservice instances.

The Scaling Challenge

The legacy ZooKeeper implementation faced critical limitations:

  • Write amplification: Direct writes from application servers caused massive spikes during deployments
  • Read storms: ZooKeeper watches triggered cascading read requests under load
  • Consistency bottlenecks: Strict ordering requirements caused write backlogs during read saturation
  • Java-centric limitations: Non-JVM clients required complex bridging solutions

With projections showing the system would exceed capacity by 2025, engineers designed a new architecture separating write and read paths.

Kafka-xDS Architecture

The replacement system implements a clear separation of concerns:

Write Path

  • Application servers publish service registration events to dedicated Apache Kafka topics
  • Kafka provides durable, ordered write sequencing with horizontal scalability

Read Path

  • Stateful Observer services consume Kafka streams to maintain in-memory service registries
  • Observers expose state via xDS (gRPC) protocol compatible with Envoy and gRPC clients
  • Clients maintain persistent gRPC streams receiving incremental updates

Author photo Patrick Farry, Software Architect

Technical Advantages

  1. Scalability: Observer layer scales horizontally per data center fabric
  2. Polyglot support: Native xDS integration enables first-class clients in Python, Go, Rust, etc.
  3. Performance: Benchmarks show single Observer instances handle:
    • 40,000 concurrent client streams
    • 10,000 updates per second
  4. Reduced latency: Data propagation dropped from P99 30s to under 5s

Zero-Downtime Migration

The migration strategy employed several key techniques:

  • Dual-write mode: Applications simultaneously wrote to ZooKeeper and Kafka during transition
  • Shadow reads: Observers validated against ZooKeeper state before traffic cutover
  • Automated legacy detection: Cron jobs identified remaining ZooKeeper dependencies
  • Cross-fabric failover: Clients could connect to remote Observers during regional outages

Future Integration Paths

The xDS foundation enables seamless integration with:

  • Service mesh control planes (e.g., Envoy-based implementations)
  • Global load balancing systems
  • Centralized traffic management policies

LinkedIn's migration demonstrates how replacing strongly consistent coordination systems with log-based architectures can solve scaling limitations for service discovery. The Kafka-xDS approach provides a blueprint for organizations facing similar scaling challenges with legacy systems.

For implementation details, see LinkedIn's engineering blog post (official publication).

Comments

Loading comments...