Pinterest’s platform team traced intermittent CPU starvation on its PinCompute Kubernetes platform to leaked memory cgroups left by a crash‑looping ECS agent. By disabling the unused agent and purging ~70 000 zombie cgroups, they eliminated ENA driver resets, restored network stability, and improved ML job success rates.

Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks

Pinterest has published a detailed post‑mortem describing how its platform team diagnosed and fixed a subtle CPU starvation problem that was crashing large‑scale machine‑learning (ML) training jobs on PinCompute, the company’s Kubernetes‑based compute platform.

Technical Announcement

During a routine reliability review the team observed a 25 % drop in training‑job success rates for several workloads. The failures manifested as intermittent network errors and Elastic Network Adapter (ENA) device resets, which in turn caused Ray clusters to abort. Initial dashboards showed overall CPU utilisation well within limits, masking the underlying issue.

Deep‑Dive Specifications

Component	Typical Load	Observed Anomaly
kubelet CPU usage	< 1 %	Spikes to ~6.5 % on a single core
ENA driver NAPI poll thread	< 2 %	Starved for cycles during spikes
Memory cgroup count	~240 active	~70 000 zombie cgroups
Ray clusters provisioned monthly	10 k‑70 k	No change, but success rate fell

Symptom Isolation

Per‑core monitoring – The team switched from aggregate metrics to mpstat -P ALL and discovered that individual cores were hitting 100 % system CPU for brief intervals.
Perf captures – Rolling two‑minute perf record sessions were collected over a 12‑hour reproduction window. The captures were visualised with Netflix’s Flamescope, allowing the engineers to zoom into the exact timestamps of ENA resets.
Kernel hotspot – The flame graphs highlighted the function mem_cgroup_nr_lru_pages consuming the bulk of the CPU time during spikes.

Root Cause Analysis

The culprit turned out to be an Amazon ECS agent baked into the AWS Deep Learning AMI used for PinCompute nodes. The agent was enabled by default, never used by Pinterest, and was stuck in a crash‑loop. Each restart leaked a memory cgroup (memcg). Over time the node accumulated ≈70 000 zombie memcgs while only ~240 were actively used.

The kubelet’s periodic cgroup statistics sync walks the entire list of memcgs. With the inflated list, a single core spent seconds walking the data structure, starving the ENA driver’s NAPI poll thread. When the driver could not process Tx completions within five seconds, the ENA hardware reset logic kicked in, dropping packets and crashing the Ray jobs.

Resolution Steps

Step	Action	Rationale
1	Disable the ECS agent systemd unit in the base image (`systemctl disable ecs.service`)	Prevents the crash‑loop and further memcg leaks
2	Reboot all affected nodes to purge existing zombie cgroups	Clears the inflated cgroup list, restoring normal kubelet performance
3	Add a health‑check in the node‑image build pipeline to verify `memcg` count stays below a safe threshold (e.g., 1 000)	Guarantees future regressions are caught early

Since the rollout, CPU usage on the affected core has returned to < 1 %, ENA resets have stopped, and the ML job success rate has recovered to baseline levels.

Real‑World Implications

Observability Lessons

Aggregate metrics can hide per‑core contention. The team had to drop down to per‑core mpstat and perf to see the problem.
Temporal profiling is essential at scale. Continuous, indexed profiling tools (e.g., Intel’s gProfiler, eBPF‑based Parca, or Grafana Pyroscope) would have surfaced the memcg‑related spikes automatically, reducing mean‑time‑to‑resolution.

Platform‑Engineering Takeaways

Base‑image hygiene matters. Even a rarely used daemon can introduce kernel‑level state that scales poorly.
Cgroup explosion is a realistic failure mode. Systems that enumerate large numbers of cgroups (kubelet, systemd‑cgroup) can become CPU‑bound if the list grows unchecked.
Cross‑layer visibility is critical. The bug spanned user‑space (ECS agent), container runtime (kubelet), and kernel networking (ENA driver). Without tools that bridge those layers, root cause analysis becomes a needle‑in‑haystack problem.

Recommendations for Similar Environments

Audit all enabled services in your node images; disable anything not required for your workload.
Instrument cgroup counts and set alerts when they exceed a small multiple of the expected active set.
Deploy a fleet‑wide continuous profiling solution that captures per‑process CPU stacks at a low overhead (e.g., 1 % sampling rate).
Incorporate kernel‑level eBPF probes that monitor NAPI poll latency; sudden spikes can indicate driver starvation before a reset occurs.

Author Bio

Mark Silvester is a Platform and Architecture Manager at Griffiths Waite, a software consultancy based in Birmingham, UK. He focuses on cloud‑native platform strategy, DevOps practices, and the practical application of AI in engineering.

For further reading, see the original Pinterest engineering post on their blog and the related discussion on InfoQ.

#cgroups #Kubernetes #ENA #Profiling #AWS

Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks

Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks

Technical Announcement

Deep‑Dive Specifications

Symptom Isolation

Root Cause Analysis

Resolution Steps

Real‑World Implications

Observability Lessons

Platform‑Engineering Takeaways

Recommendations for Similar Environments

Author Bio

Comments