Pinterest’s platform team traced intermittent CPU starvation on its PinCompute Kubernetes platform to leaked memory cgroups left by a crash‑looping ECS agent. By disabling the unused agent and purging ~70 000 zombie cgroups, they eliminated ENA driver resets, restored network stability, and improved ML job success rates.
Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks

Pinterest has published a detailed post‑mortem describing how its platform team diagnosed and fixed a subtle CPU starvation problem that was crashing large‑scale machine‑learning (ML) training jobs on PinCompute, the company’s Kubernetes‑based compute platform.
Technical Announcement
During a routine reliability review the team observed a 25 % drop in training‑job success rates for several workloads. The failures manifested as intermittent network errors and Elastic Network Adapter (ENA) device resets, which in turn caused Ray clusters to abort. Initial dashboards showed overall CPU utilisation well within limits, masking the underlying issue.
Deep‑Dive Specifications
| Component | Typical Load | Observed Anomaly |
|---|---|---|
| kubelet CPU usage | < 1 % | Spikes to ~6.5 % on a single core |
| ENA driver NAPI poll thread | < 2 % | Starved for cycles during spikes |
| Memory cgroup count | ~240 active | ~70 000 zombie cgroups |
| Ray clusters provisioned monthly | 10 k‑70 k | No change, but success rate fell |
Symptom Isolation
- Per‑core monitoring – The team switched from aggregate metrics to
mpstat -P ALLand discovered that individual cores were hitting 100 % system CPU for brief intervals. - Perf captures – Rolling two‑minute
perf recordsessions were collected over a 12‑hour reproduction window. The captures were visualised with Netflix’s Flamescope, allowing the engineers to zoom into the exact timestamps of ENA resets. - Kernel hotspot – The flame graphs highlighted the function
mem_cgroup_nr_lru_pagesconsuming the bulk of the CPU time during spikes.
Root Cause Analysis
The culprit turned out to be an Amazon ECS agent baked into the AWS Deep Learning AMI used for PinCompute nodes. The agent was enabled by default, never used by Pinterest, and was stuck in a crash‑loop. Each restart leaked a memory cgroup (memcg). Over time the node accumulated ≈70 000 zombie memcgs while only ~240 were actively used.
The kubelet’s periodic cgroup statistics sync walks the entire list of memcgs. With the inflated list, a single core spent seconds walking the data structure, starving the ENA driver’s NAPI poll thread. When the driver could not process Tx completions within five seconds, the ENA hardware reset logic kicked in, dropping packets and crashing the Ray jobs.
Resolution Steps
| Step | Action | Rationale |
|---|---|---|
| 1 | Disable the ECS agent systemd unit in the base image (systemctl disable ecs.service) |
Prevents the crash‑loop and further memcg leaks |
| 2 | Reboot all affected nodes to purge existing zombie cgroups | Clears the inflated cgroup list, restoring normal kubelet performance |
| 3 | Add a health‑check in the node‑image build pipeline to verify memcg count stays below a safe threshold (e.g., 1 000) |
Guarantees future regressions are caught early |
Since the rollout, CPU usage on the affected core has returned to < 1 %, ENA resets have stopped, and the ML job success rate has recovered to baseline levels.
Real‑World Implications
Observability Lessons
- Aggregate metrics can hide per‑core contention. The team had to drop down to per‑core
mpstatand perf to see the problem. - Temporal profiling is essential at scale. Continuous, indexed profiling tools (e.g., Intel’s gProfiler, eBPF‑based Parca, or Grafana Pyroscope) would have surfaced the memcg‑related spikes automatically, reducing mean‑time‑to‑resolution.
Platform‑Engineering Takeaways
- Base‑image hygiene matters. Even a rarely used daemon can introduce kernel‑level state that scales poorly.
- Cgroup explosion is a realistic failure mode. Systems that enumerate large numbers of cgroups (kubelet, systemd‑cgroup) can become CPU‑bound if the list grows unchecked.
- Cross‑layer visibility is critical. The bug spanned user‑space (ECS agent), container runtime (kubelet), and kernel networking (ENA driver). Without tools that bridge those layers, root cause analysis becomes a needle‑in‑haystack problem.
Recommendations for Similar Environments
- Audit all enabled services in your node images; disable anything not required for your workload.
- Instrument cgroup counts and set alerts when they exceed a small multiple of the expected active set.
- Deploy a fleet‑wide continuous profiling solution that captures per‑process CPU stacks at a low overhead (e.g., 1 % sampling rate).
- Incorporate kernel‑level eBPF probes that monitor NAPI poll latency; sudden spikes can indicate driver starvation before a reset occurs.
Author Bio
Mark Silvester is a Platform and Architecture Manager at Griffiths Waite, a software consultancy based in Birmingham, UK. He focuses on cloud‑native platform strategy, DevOps practices, and the practical application of AI in engineering.
For further reading, see the original Pinterest engineering post on their blog and the related discussion on InfoQ.

Comments
Please log in or register to join the discussion