Duolingo's Kubernetes Leap: From ECS to EKS at Scale
#DevOps

Duolingo's Kubernetes Leap: From ECS to EKS at Scale

Rust Reporter
3 min read

Duolingo's migration from ECS to EKS for 500+ backend services, featuring IPv6 adoption, GitOps with Argo CD, and cellular architecture for safe deployments.

Franka Passing, Senior Platform Engineer at Duolingo, shared the company's journey migrating over 500 backend services from AWS ECS to EKS. The migration was driven by the need for advanced deployment strategies, richer ecosystem tooling, and IPv6 readiness.

Why Migrate from ECS to Kubernetes?

Duolingo had been running successfully on ECS for years, but as the company grew to 128 million monthly active users and 400+ engineers, they needed more sophisticated capabilities:

  • Blue-green deployments with automated rollback based on latency and error metrics
  • Ephemeral development environments for testing pull requests
  • Advanced scaling with tools like Karpenter for Spot instance optimization
  • Multi-cloud flexibility for future infrastructure options

Building the Foundation

The migration began with a small team of 6-7 engineers working alongside platform specialists in observability, security, and CI/CD. Key architectural decisions included:

GitOps with Argo CD

Duolingo adopted Argo CD for its declarative, GitOps approach:

  • Blue-green rollouts with automated health checks on latency and 5xx errors
  • Custom deployment strategies including phased canary deployments
  • Cellular architecture with isolated tenants (dev/stage/prod environments) for safe testing

IPv6-Only Pod Networking

A bold decision to use IPv6-only pods with dual-stack VPCs:

  • Future-proofed infrastructure with no IPv4 address exhaustion concerns
  • Required application code updates to accept IPv6 connections
  • Some AWS service limitations (DynamoDB IPv6 support added recently)
  • Unexpected NAT costs due to IPv4-only external dependencies

Observability Integration

Comprehensive monitoring setup across multiple tools:

  • Honeycomb for distributed tracing with Kubernetes cluster tags
  • Sentry for error tracking
  • PagerDuty for alerting
  • CloudWatch for AWS metrics

Challenges included distinguishing ECS vs EKS alerts and maintaining familiar interfaces for developers.

Service Templates

Two Helm chart templates were created:

  • Web services with HTTP ingress
  • Worker services with KEDA-based queue scaling

Terraform still managed AWS permissions and environment variables via EKS Pod Identity.

Migrating Services: The Owl-Service Example

The migration process followed a structured approach:

  1. Terraform setup for AWS permissions and observability
  2. Argo CD manifests defining service configuration, scaling, and deployment strategy
  3. Service validation comparing metrics, traces, and responses against ECS baseline
  4. Canary testing with weighted DNS routing (10-100% traffic gradually)

The owl-service, a Python backend, was the second production service migrated. The team used DNS weighting for traffic control, allowing quick rollbacks when issues arose.

Challenges and Lessons

Recency Bias

Teams blamed new issues on Kubernetes rather than the platform itself. Solution: Enhanced observability to prove root causes and hands-on incident support to build trust.

Rate Limiting

Unexpected AWS service limits emerged during rollout:

  • EKS Pod Identity API throttling
  • AMP (Managed Prometheus) rate limits

Mitigation: Slow rollouts, early migration of large-scale services, and close AWS TAM collaboration.

Human Factors

  • VIP support with dedicated EKS team partners for each service
  • Training sessions and documentation updates
  • Flexible timelines allowing teams to migrate at their own pace

Current State and Future

As of the presentation, Duolingo had:

  • All new services automatically deployed to EKS
  • 10 most critical services migrated
  • ECS services kept running at 1% traffic for rollback capability
  • Automated migration workflows reducing manual effort

The migration remains ongoing, with plans for general adoption after early adopter validation.

Key Takeaways

  1. Strong user demand justifies migration costs
  2. Observability-first approach builds confidence in new platforms
  3. Gradual rollout with canary testing prevents large-scale failures
  4. Cellular architecture enables safe platform changes
  5. IPv6 adoption future-proofs infrastructure despite initial friction

The presentation concluded with a Q&A addressing technical reasons for migration (deployment strategies, Karpenter, Argo CD features) and observability approaches using OpenTelemetry collectors feeding into Honeycomb.

Comments

Loading comments...