Polar Signals Launches Off-CPU Profiling in Parca Agent: Unmasking Hidden Latency Bottlenecks
Share this article
For performance engineers optimizing latency-sensitive systems, understanding what happens when processes aren't running on CPUs has long been a blind spot. Polar Signals tackles this gap with Off-CPU profiling for its eBPF-based Parca Agent, finally letting developers quantify time spent waiting for I/O, network responses, and other non-CPU bottlenecks.
Why Off-CPU Matters
While On-CPU profiling reveals compute-bound inefficiencies, it ignores critical latency sources like:
- Disk I/O contention
- Network call delays
- Lock synchronization waits
- Scheduling pauses
"When optimizing latency, it's crucial to understand why our process isn't performing work on the CPU," notes Polar Signals' engineering team. Without this data, engineers miss systemic slowdowns invisible to traditional profilers.
Under the Hood: Kernel Tracing & Sampling
Implementing Off-CPU required novel instrumentation:
1. Tracepoint Hooks: Leveraging Linux's sched:sched_switch to detect when tasks leave the CPU
2. Kprobe Tracking: Using finish_task_switch.isra.0 to measure off-CPU duration
3. Sampling Throttle: The --off-cpu-threshold flag controls overhead by sampling events (e.g., 50/1000) to avoid flooding systems
"Kernel scheduling events can occur thousands of times per second. Without sampling, overhead becomes prohibitive," explains developer Florian Lehner, who spearheaded the data collection.
Cutting Through Runtime Noise
Initial deployments revealed a surprise: Runtime systems dominated off-CPU traces. In Go, runtime.usleep and garbage collection pauses appeared as top offenders, while Rust's Tokio runtime generated similar noise.
Polar Signals responded with a filtering toolkit:
- Stack Exclusion: "Not contains" filters remove known runtime patterns
- Multi-Filter Support: Combine exclusions (e.g., GC + timers)
- Runtime Presets: Preconfigured filters for Go and Tokio
// Example: Applying Go runtime preset
offcpu.FilterPreset("go-runtime-expected")
After filtering, true culprits emerged—like network I/O stalls in Prometheus servers where EpollWait and syscall.Write dominated latency.
The New Optimization Workflow
- Capture On-CPU profile to optimize compute paths
- Run Off-CPU analysis to identify I/O/scheduling bottlenecks
- Apply runtime presets to eliminate noise
- Triage remaining stacks (e.g., database calls, filesystem syncs)
"We now see how allocation-heavy code triggers both CPU costs and scheduling penalties via GC," observes the team. This dual visibility is revolutionary for tuning cloud-native systems.
Future Extensions
Polar Signals invites community input to expand presets for additional runtimes (Java, Node.js, .NET). Early adopters can test the feature in Parca v0.22.0+ and share feedback via Discord.