Duolingo's Kubernetes Leap: From ECS to EKS at Scale

Duolingo's migration from ECS to EKS for 500+ backend services, featuring IPv6 adoption, GitOps with Argo CD, and cellular architecture for safe deployments.

Franka Passing, Senior Platform Engineer at Duolingo, shared the company's journey migrating over 500 backend services from AWS ECS to EKS. The migration was driven by the need for advanced deployment strategies, richer ecosystem tooling, and IPv6 readiness.

Why Migrate from ECS to Kubernetes?

Duolingo had been running successfully on ECS for years, but as the company grew to 128 million monthly active users and 400+ engineers, they needed more sophisticated capabilities:

Blue-green deployments with automated rollback based on latency and error metrics
Ephemeral development environments for testing pull requests
Advanced scaling with tools like Karpenter for Spot instance optimization
Multi-cloud flexibility for future infrastructure options

Building the Foundation

The migration began with a small team of 6-7 engineers working alongside platform specialists in observability, security, and CI/CD. Key architectural decisions included:

GitOps with Argo CD

Duolingo adopted Argo CD for its declarative, GitOps approach:

Blue-green rollouts with automated health checks on latency and 5xx errors
Custom deployment strategies including phased canary deployments
Cellular architecture with isolated tenants (dev/stage/prod environments) for safe testing

IPv6-Only Pod Networking

A bold decision to use IPv6-only pods with dual-stack VPCs:

Future-proofed infrastructure with no IPv4 address exhaustion concerns
Required application code updates to accept IPv6 connections
Some AWS service limitations (DynamoDB IPv6 support added recently)
Unexpected NAT costs due to IPv4-only external dependencies

Observability Integration

Comprehensive monitoring setup across multiple tools:

Honeycomb for distributed tracing with Kubernetes cluster tags
Sentry for error tracking
PagerDuty for alerting
CloudWatch for AWS metrics

Challenges included distinguishing ECS vs EKS alerts and maintaining familiar interfaces for developers.

Service Templates

Two Helm chart templates were created:

Web services with HTTP ingress
Worker services with KEDA-based queue scaling

Terraform still managed AWS permissions and environment variables via EKS Pod Identity.

Migrating Services: The Owl-Service Example

The migration process followed a structured approach:

Terraform setup for AWS permissions and observability
Argo CD manifests defining service configuration, scaling, and deployment strategy
Service validation comparing metrics, traces, and responses against ECS baseline
Canary testing with weighted DNS routing (10-100% traffic gradually)

The owl-service, a Python backend, was the second production service migrated. The team used DNS weighting for traffic control, allowing quick rollbacks when issues arose.

Challenges and Lessons

Recency Bias

Teams blamed new issues on Kubernetes rather than the platform itself. Solution: Enhanced observability to prove root causes and hands-on incident support to build trust.

Rate Limiting

Unexpected AWS service limits emerged during rollout:

EKS Pod Identity API throttling
AMP (Managed Prometheus) rate limits

Mitigation: Slow rollouts, early migration of large-scale services, and close AWS TAM collaboration.

Human Factors

VIP support with dedicated EKS team partners for each service
Training sessions and documentation updates
Flexible timelines allowing teams to migrate at their own pace

Current State and Future

As of the presentation, Duolingo had:

All new services automatically deployed to EKS
10 most critical services migrated
ECS services kept running at 1% traffic for rollback capability
Automated migration workflows reducing manual effort

The migration remains ongoing, with plans for general adoption after early adopter validation.

Key Takeaways

Strong user demand justifies migration costs
Observability-first approach builds confidence in new platforms
Gradual rollout with canary testing prevents large-scale failures
Cellular architecture enables safe platform changes
IPv6 adoption future-proofs infrastructure despite initial friction

The presentation concluded with a Q&A addressing technical reasons for migration (deployment strategies, Karpenter, Argo CD features) and observability approaches using OpenTelemetry collectors feeding into Honeycomb.

#Kubernetes #EKS #Observability #migration #Cloud