LinkedIn engineers used eBPF off‑CPU profiling to capture a fleeting kernel lock contention caused by a massive Rust HashMap resize. Pre‑allocating the data structure eliminated the 3.5 GB mmap_lock stall that intermittently froze the feed database.
Technical announcement
LinkedIn’s feed service experienced a series of 10‑15 second outages that left no trace in application logs. The incidents manifested as a sudden loss of database availability, a brief spike in memory allocation, and an immediate return to normal operation at a higher memory baseline. After conventional metrics proved inconclusive, the team deployed an automated eBPF off‑CPU profiler that captured kernel stack traces at the exact moment of each freeze. The profiling revealed a kernel‑level lock contention on the mmap_lock semaphore, triggered by a 3.5 GB memory allocation during a Rust HashMap resize.

Specifications
| Component | Detail |
|---|---|
| Profiler | bcc‑based offcputime.py script, sampling interval 1 ms, capturing full kernel stack for blocked threads |
| Trigger | Custom health check on feed DB latency; when latency > 200 ms for > 2 s, script starts a 15 s profiling window |
| Lock observed | mmap_lock (write mode) held for ~3.5 GB allocation, blocking all threads that touch virtual memory (page faults, madvise, mmap) |
| Root cause | Rust HashMap<pkey_vs_docref> exceeded 58,720,256 entries, causing a resize that doubled the map size and allocated ~3.5 GB of virtual address space |
| Mitigation | Pre‑allocate the hash map to the maximum expected size at service start‑up, increasing resident memory by ~3 GB |
| Performance impact | No measurable increase in request latency; memory footprint grew from ~12 GB to ~15 GB, well within host limits |
| Deployment | Updated Docker image with pre‑allocation logic; rollout via canary to 5 % of pods, full rollout after 30 min of stable operation |
Off‑CPU profiling workflow
- Continuous health monitor – a lightweight Go routine polls the feed DB latency metric every 500 ms.
- Freeze detection – when the latency threshold is breached, the monitor invokes a shell wrapper.
- eBPF activation – the wrapper launches
sudo /usr/share/bcc/tools/offcputime.py -p <pid> -T 15 -o /tmp/profile-${TS}.txt. - Stack aggregation – after the window closes, the script aggregates kernel stacks and highlights the most frequent call sites.
- Alerting – the aggregated profile is uploaded to the internal observability platform for triage.
The approach guarantees that the profiler is already attached when the freeze starts, avoiding the classic “too‑late after the fact” problem of post‑mortem analysis.
Real‑world implications
1. Memory‑heavy data structures in latency‑sensitive services
Pre‑allocating large containers eliminates the need for runtime resizing, which can trigger heavyweight kernel operations. In this case, a single HashMap growth caused the entire process to block because the kernel must serialize all address‑space modifications under mmap_lock. Teams building high‑throughput services should audit any container that can grow beyond a few million entries and consider static allocation or tiered sharding.
2. eBPF as a first‑line diagnostic for “silent” freezes
Traditional APM tools focus on CPU‑time and I/O metrics; they rarely capture threads that are blocked inside the kernel. Off‑CPU profiling fills that gap by exposing where the scheduler is parking tasks. The LinkedIn case demonstrates that a short‑lived lock can be invisible to CPU utilization charts yet catastrophic for request latency.
3. Automated instrumentation on failure conditions
Embedding a trigger‑based profiler in production reduces the need for manual debugging sessions. The script runs with minimal overhead (≈ 0.2 % CPU) and only activates when the health check flags an anomaly. This pattern can be generalized: any metric that indicates a symptom (e.g., sudden latency spike, error burst) can launch a targeted eBPF trace, capturing the exact kernel state at the moment of failure.
4. Trade‑offs of increased resident memory
The fix added ~3 GB of RAM consumption per service instance. For LinkedIn’s fleet, this translated to an additional ~150 TB of memory across all pods. The engineering team evaluated the cost against the SLA impact of the freezes and concluded that the memory overhead was acceptable. Organizations must weigh similar trade‑offs, especially in environments where memory is a scarce resource.
Takeaways for infrastructure teams
- Audit large containers: Identify any data structures that can trigger massive allocations; benchmark their growth patterns under realistic loads.
- Deploy eBPF probes proactively: Use BCC or libbpf to embed low‑overhead tracers that can be toggled on demand.
- Correlate memory spikes with kernel locks: Tools like
perf lockorbpftracecan surface contention on global semaphores such asmmap_lock. - Plan for memory budgeting: Pre‑allocation may increase baseline memory usage; ensure capacity planning accounts for worst‑case footprints.
The LinkedIn incident underscores that even well‑instrumented services can suffer from “invisible” kernel stalls. By marrying health‑check‑driven automation with eBPF off‑CPU profiling, the team turned a fleeting, hard‑to‑detect freeze into a concrete, fixable problem.

Author: Sergio De Simone – senior software engineer with extensive experience in systems programming and observability tooling.

Comments
Please log in or register to join the discussion