Duolingo's migration from ECS to EKS for 500+ backend services, featuring IPv6 adoption, GitOps with Argo CD, and cellular architecture for safe deployments.
Franka Passing, Senior Platform Engineer at Duolingo, shared the company's journey migrating over 500 backend services from AWS ECS to EKS. The migration was driven by the need for advanced deployment strategies, richer ecosystem tooling, and IPv6 readiness.
Why Migrate from ECS to Kubernetes?
Duolingo had been running successfully on ECS for years, but as the company grew to 128 million monthly active users and 400+ engineers, they needed more sophisticated capabilities:
- Blue-green deployments with automated rollback based on latency and error metrics
- Ephemeral development environments for testing pull requests
- Advanced scaling with tools like Karpenter for Spot instance optimization
- Multi-cloud flexibility for future infrastructure options
Building the Foundation
The migration began with a small team of 6-7 engineers working alongside platform specialists in observability, security, and CI/CD. Key architectural decisions included:
GitOps with Argo CD
Duolingo adopted Argo CD for its declarative, GitOps approach:
- Blue-green rollouts with automated health checks on latency and 5xx errors
- Custom deployment strategies including phased canary deployments
- Cellular architecture with isolated tenants (dev/stage/prod environments) for safe testing
IPv6-Only Pod Networking
A bold decision to use IPv6-only pods with dual-stack VPCs:
- Future-proofed infrastructure with no IPv4 address exhaustion concerns
- Required application code updates to accept IPv6 connections
- Some AWS service limitations (DynamoDB IPv6 support added recently)
- Unexpected NAT costs due to IPv4-only external dependencies
Observability Integration
Comprehensive monitoring setup across multiple tools:
- Honeycomb for distributed tracing with Kubernetes cluster tags
- Sentry for error tracking
- PagerDuty for alerting
- CloudWatch for AWS metrics
Challenges included distinguishing ECS vs EKS alerts and maintaining familiar interfaces for developers.
Service Templates
Two Helm chart templates were created:
- Web services with HTTP ingress
- Worker services with KEDA-based queue scaling
Terraform still managed AWS permissions and environment variables via EKS Pod Identity.
Migrating Services: The Owl-Service Example
The migration process followed a structured approach:
- Terraform setup for AWS permissions and observability
- Argo CD manifests defining service configuration, scaling, and deployment strategy
- Service validation comparing metrics, traces, and responses against ECS baseline
- Canary testing with weighted DNS routing (10-100% traffic gradually)
The owl-service, a Python backend, was the second production service migrated. The team used DNS weighting for traffic control, allowing quick rollbacks when issues arose.
Challenges and Lessons
Recency Bias
Teams blamed new issues on Kubernetes rather than the platform itself. Solution: Enhanced observability to prove root causes and hands-on incident support to build trust.
Rate Limiting
Unexpected AWS service limits emerged during rollout:
- EKS Pod Identity API throttling
- AMP (Managed Prometheus) rate limits
Mitigation: Slow rollouts, early migration of large-scale services, and close AWS TAM collaboration.
Human Factors
- VIP support with dedicated EKS team partners for each service
- Training sessions and documentation updates
- Flexible timelines allowing teams to migrate at their own pace
Current State and Future
As of the presentation, Duolingo had:
- All new services automatically deployed to EKS
- 10 most critical services migrated
- ECS services kept running at 1% traffic for rollback capability
- Automated migration workflows reducing manual effort
The migration remains ongoing, with plans for general adoption after early adopter validation.
Key Takeaways
- Strong user demand justifies migration costs
- Observability-first approach builds confidence in new platforms
- Gradual rollout with canary testing prevents large-scale failures
- Cellular architecture enables safe platform changes
- IPv6 adoption future-proofs infrastructure despite initial friction
The presentation concluded with a Q&A addressing technical reasons for migration (deployment strategies, Karpenter, Argo CD features) and observability approaches using OpenTelemetry collectors feeding into Honeycomb.

Comments
Please log in or register to join the discussion