Salesforce completed a large-scale migration of over 1,000 Amazon EKS clusters from Kubernetes Cluster Autoscaler to Karpenter, reducing scaling latency from minutes to seconds while cutting operational overhead by 80%.

Salesforce has successfully migrated its entire fleet of over 1,000 Amazon Elastic Kubernetes Service (EKS) clusters from the Kubernetes Cluster Autoscaler to Karpenter, AWS's open-source provisioning system. This architectural shift addresses critical scaling limitations inherent in traditional autoscaling approaches while delivering measurable improvements in infrastructure efficiency and developer velocity.
Technical Limitations Driving Change
The migration addressed three core limitations of Cluster Autoscaler:
- Scaling Latency: Cluster Autoscaler's dependency on Amazon EC2 Auto Scaling groups introduced provisioning delays of 3-5 minutes due to ASG initialization sequences.
- Availability Zone Imbalance: Static ASG configurations per AZ prevented efficient bin-packing across zones, resulting in uneven resource utilization.
- Operational Overhead: Maintaining thousands of node groups (often 1:1 with workloads) created configuration sprawl. Each required manual tuning of parameters like instance types, AMIs, and scaling policies.
Karpenter's Architectural Advantages
Karpenter's design fundamentally differs by bypassing Auto Scaling Groups entirely. Key technical differentiators:
- Direct EC2 API Integration: Provisions nodes via direct EC2 API calls, eliminating ASG mediation latency
- Real-Time Bin Packing: Evaluates pending pods against all available EC2 instance types (including GPU/ARM) in real-time
- Consolidation Algorithms: Continuously analyzes pod-to-node mappings to rightsize or replace underutilized nodes
- Provisioner CRDs: Replaces static ASGs with dynamic Provisioner configurations that support flexible constraints
Migration Mechanics and Tooling
The phased migration involved:
- Custom Transition Automation: Salesforce built an in-house migration controller that handled:
- Gradual node rotation respecting Pod Disruption Budgets
- Automated AMI validation using golden image pipelines
- Rollback capabilities via GitOps-managed configuration snapshots
- Workload Characterization: Analyzed pod scheduling requirements across 200+ microservice patterns to define optimal Provisioner constraints
- Failure Domain Handling: Developed workarounds for Kubernetes' 63-character label limit that previously blocked AZ-aware scheduling
Performance Benchmarks
Post-migration metrics show significant improvements:
| Metric | Before (Cluster Autoscaler) | After (Karpenter) | Improvement |
|---|---|---|---|
| Scaling Latency | 180-300 seconds | 2-15 seconds | 94% reduction |
| Node Utilization | 55-65% | 72-78% | 17-23% increase |
| Operational Tasks | 120+ hours/week | <24 hours/week | 80% reduction |
| Node Groups | ~4,000 | 0 (replaced by Provisioners) | 100% reduction |
Cost savings are projected at 5% ($4.2M) in FY2026, with additional 5-10% ($8-10M) expected in FY2027 through optimized spot instance usage and consolidation efficiency.
Implementation Challenges and Solutions
- PDB Constraints: 15% of clusters had misconfigured Pod Disruption Budgets blocking node rotation. Solution: Implemented pre-migration validation checks and automated PDB annotation tools.
- Stateful Workloads: Single-replica applications required custom disruption budgets to prevent simultaneous node replacements.
- Label Limitations: Developed a label compression scheme to encode AZ/instance metadata within Kubernetes' 63-character label limit.
Broader Ecosystem Context
This migration aligns with patterns observed at Coinbase and BMW Group, where Karpenter replaced Cluster Autoscaler to handle bursty, heterogeneous workloads. Common themes:
- Elimination of ASG management overhead
- Improved spot instance integration (up to 70% cost reduction)
- Sub-30-second scale-out for AI/ML workloads
What distinguishes Salesforce's implementation is the fleet-scale automation: Their custom toolchain enabled consistent migration across 1,000+ clusters while maintaining strict SLO compliance. The architecture now supports developer self-service through:
- Git-managed Provisioner configurations
- Namespace-scoped resource constraints
- Automated capacity validation in CI pipelines
Operational Best Practices
Salesforce's experience yields concrete recommendations:
- Implement pre-flight checks for PDB configurations before migration
- Establish per-workload disruption budgets using Karpenter's drift mechanisms
- Standardize Provisioner configurations using hierarchical inheritance
- Monitor consolidation metrics via Prometheus integration
For organizations managing large Kubernetes fleets, this migration demonstrates that replacing Cluster Autoscaler with Karpenter can yield order-of-magnitude improvements in responsiveness while significantly reducing operational burden. The transition requires careful workload analysis and automation tooling—particularly for stateful services—but delivers substantial ROI through accelerated scaling and optimized resource utilization.

Comments
Please log in or register to join the discussion