LinkedIn Traces Kernel mmap_lock Contention Behind 10‑Second Feed Outages
#Infrastructure

LinkedIn Traces Kernel mmap_lock Contention Behind 10‑Second Feed Outages

Infrastructure Reporter
4 min read

LinkedIn engineers used eBPF off‑CPU profiling to capture a fleeting kernel lock contention caused by a massive Rust HashMap resize. Pre‑allocating the data structure eliminated the 3.5 GB mmap_lock stall that intermittently froze the feed database.

Technical announcement

LinkedIn’s feed service experienced a series of 10‑15 second outages that left no trace in application logs. The incidents manifested as a sudden loss of database availability, a brief spike in memory allocation, and an immediate return to normal operation at a higher memory baseline. After conventional metrics proved inconclusive, the team deployed an automated eBPF off‑CPU profiler that captured kernel stack traces at the exact moment of each freeze. The profiling revealed a kernel‑level lock contention on the mmap_lock semaphore, triggered by a 3.5 GB memory allocation during a Rust HashMap resize.

Featured image


Specifications

Component Detail
Profiler bcc‑based offcputime.py script, sampling interval 1 ms, capturing full kernel stack for blocked threads
Trigger Custom health check on feed DB latency; when latency > 200 ms for > 2 s, script starts a 15 s profiling window
Lock observed mmap_lock (write mode) held for ~3.5 GB allocation, blocking all threads that touch virtual memory (page faults, madvise, mmap)
Root cause Rust HashMap<pkey_vs_docref> exceeded 58,720,256 entries, causing a resize that doubled the map size and allocated ~3.5 GB of virtual address space
Mitigation Pre‑allocate the hash map to the maximum expected size at service start‑up, increasing resident memory by ~3 GB
Performance impact No measurable increase in request latency; memory footprint grew from ~12 GB to ~15 GB, well within host limits
Deployment Updated Docker image with pre‑allocation logic; rollout via canary to 5 % of pods, full rollout after 30 min of stable operation

Off‑CPU profiling workflow

  1. Continuous health monitor – a lightweight Go routine polls the feed DB latency metric every 500 ms.
  2. Freeze detection – when the latency threshold is breached, the monitor invokes a shell wrapper.
  3. eBPF activation – the wrapper launches sudo /usr/share/bcc/tools/offcputime.py -p <pid> -T 15 -o /tmp/profile-${TS}.txt.
  4. Stack aggregation – after the window closes, the script aggregates kernel stacks and highlights the most frequent call sites.
  5. Alerting – the aggregated profile is uploaded to the internal observability platform for triage.

The approach guarantees that the profiler is already attached when the freeze starts, avoiding the classic “too‑late after the fact” problem of post‑mortem analysis.


Real‑world implications

1. Memory‑heavy data structures in latency‑sensitive services

Pre‑allocating large containers eliminates the need for runtime resizing, which can trigger heavyweight kernel operations. In this case, a single HashMap growth caused the entire process to block because the kernel must serialize all address‑space modifications under mmap_lock. Teams building high‑throughput services should audit any container that can grow beyond a few million entries and consider static allocation or tiered sharding.

2. eBPF as a first‑line diagnostic for “silent” freezes

Traditional APM tools focus on CPU‑time and I/O metrics; they rarely capture threads that are blocked inside the kernel. Off‑CPU profiling fills that gap by exposing where the scheduler is parking tasks. The LinkedIn case demonstrates that a short‑lived lock can be invisible to CPU utilization charts yet catastrophic for request latency.

3. Automated instrumentation on failure conditions

Embedding a trigger‑based profiler in production reduces the need for manual debugging sessions. The script runs with minimal overhead (≈ 0.2 % CPU) and only activates when the health check flags an anomaly. This pattern can be generalized: any metric that indicates a symptom (e.g., sudden latency spike, error burst) can launch a targeted eBPF trace, capturing the exact kernel state at the moment of failure.

4. Trade‑offs of increased resident memory

The fix added ~3 GB of RAM consumption per service instance. For LinkedIn’s fleet, this translated to an additional ~150 TB of memory across all pods. The engineering team evaluated the cost against the SLA impact of the freezes and concluded that the memory overhead was acceptable. Organizations must weigh similar trade‑offs, especially in environments where memory is a scarce resource.


Takeaways for infrastructure teams

  • Audit large containers: Identify any data structures that can trigger massive allocations; benchmark their growth patterns under realistic loads.
  • Deploy eBPF probes proactively: Use BCC or libbpf to embed low‑overhead tracers that can be toggled on demand.
  • Correlate memory spikes with kernel locks: Tools like perf lock or bpftrace can surface contention on global semaphores such as mmap_lock.
  • Plan for memory budgeting: Pre‑allocation may increase baseline memory usage; ensure capacity planning accounts for worst‑case footprints.

The LinkedIn incident underscores that even well‑instrumented services can suffer from “invisible” kernel stalls. By marrying health‑check‑driven automation with eBPF off‑CPU profiling, the team turned a fleeting, hard‑to‑detect freeze into a concrete, fixable problem.


Author photo

Author: Sergio De Simone – senior software engineer with extensive experience in systems programming and observability tooling.

Comments

Loading comments...