LinkedIn engineers migrated their service discovery system from Apache ZooKeeper to a Kafka-backed architecture with xDS protocol integration, achieving sub-second latency at scale while enabling polyglot client support.

LinkedIn has completed a multi-year migration of its service discovery infrastructure, replacing Apache ZooKeeper with an eventual-consistency architecture built on Apache Kafka and the xDS protocol. This architectural shift addresses fundamental scalability limitations while supporting LinkedIn's growth to hundreds of thousands of microservice instances.
The Scaling Challenge
The legacy ZooKeeper implementation faced critical limitations:
- Write amplification: Direct writes from application servers caused massive spikes during deployments
- Read storms: ZooKeeper watches triggered cascading read requests under load
- Consistency bottlenecks: Strict ordering requirements caused write backlogs during read saturation
- Java-centric limitations: Non-JVM clients required complex bridging solutions
With projections showing the system would exceed capacity by 2025, engineers designed a new architecture separating write and read paths.
Kafka-xDS Architecture
The replacement system implements a clear separation of concerns:
Write Path
- Application servers publish service registration events to dedicated Apache Kafka topics
- Kafka provides durable, ordered write sequencing with horizontal scalability
Read Path
- Stateful Observer services consume Kafka streams to maintain in-memory service registries
- Observers expose state via xDS (gRPC) protocol compatible with Envoy and gRPC clients
- Clients maintain persistent gRPC streams receiving incremental updates
Patrick Farry, Software Architect
Technical Advantages
- Scalability: Observer layer scales horizontally per data center fabric
- Polyglot support: Native xDS integration enables first-class clients in Python, Go, Rust, etc.
- Performance: Benchmarks show single Observer instances handle:
- 40,000 concurrent client streams
- 10,000 updates per second
- Reduced latency: Data propagation dropped from P99 30s to under 5s
Zero-Downtime Migration
The migration strategy employed several key techniques:
- Dual-write mode: Applications simultaneously wrote to ZooKeeper and Kafka during transition
- Shadow reads: Observers validated against ZooKeeper state before traffic cutover
- Automated legacy detection: Cron jobs identified remaining ZooKeeper dependencies
- Cross-fabric failover: Clients could connect to remote Observers during regional outages
Future Integration Paths
The xDS foundation enables seamless integration with:
- Service mesh control planes (e.g., Envoy-based implementations)
- Global load balancing systems
- Centralized traffic management policies
LinkedIn's migration demonstrates how replacing strongly consistent coordination systems with log-based architectures can solve scaling limitations for service discovery. The Kafka-xDS approach provides a blueprint for organizations facing similar scaling challenges with legacy systems.
For implementation details, see LinkedIn's engineering blog post (official publication).

Comments
Please log in or register to join the discussion