Article illustration 1

For performance engineers optimizing latency-sensitive systems, understanding what happens when processes aren't running on CPUs has long been a blind spot. Polar Signals tackles this gap with Off-CPU profiling for its eBPF-based Parca Agent, finally letting developers quantify time spent waiting for I/O, network responses, and other non-CPU bottlenecks.

Why Off-CPU Matters

While On-CPU profiling reveals compute-bound inefficiencies, it ignores critical latency sources like:
- Disk I/O contention
- Network call delays
- Lock synchronization waits
- Scheduling pauses

"When optimizing latency, it's crucial to understand why our process isn't performing work on the CPU," notes Polar Signals' engineering team. Without this data, engineers miss systemic slowdowns invisible to traditional profilers.

Under the Hood: Kernel Tracing & Sampling

Implementing Off-CPU required novel instrumentation:
1. Tracepoint Hooks: Leveraging Linux's sched:sched_switch to detect when tasks leave the CPU
2. Kprobe Tracking: Using finish_task_switch.isra.0 to measure off-CPU duration
3. Sampling Throttle: The --off-cpu-threshold flag controls overhead by sampling events (e.g., 50/1000) to avoid flooding systems

"Kernel scheduling events can occur thousands of times per second. Without sampling, overhead becomes prohibitive," explains developer Florian Lehner, who spearheaded the data collection.

Cutting Through Runtime Noise

Initial deployments revealed a surprise: Runtime systems dominated off-CPU traces. In Go, runtime.usleep and garbage collection pauses appeared as top offenders, while Rust's Tokio runtime generated similar noise.

Polar Signals responded with a filtering toolkit:
- Stack Exclusion: "Not contains" filters remove known runtime patterns
- Multi-Filter Support: Combine exclusions (e.g., GC + timers)
- Runtime Presets: Preconfigured filters for Go and Tokio

// Example: Applying Go runtime preset
offcpu.FilterPreset("go-runtime-expected")

After filtering, true culprits emerged—like network I/O stalls in Prometheus servers where EpollWait and syscall.Write dominated latency.

The New Optimization Workflow

  1. Capture On-CPU profile to optimize compute paths
  2. Run Off-CPU analysis to identify I/O/scheduling bottlenecks
  3. Apply runtime presets to eliminate noise
  4. Triage remaining stacks (e.g., database calls, filesystem syncs)

"We now see how allocation-heavy code triggers both CPU costs and scheduling penalties via GC," observes the team. This dual visibility is revolutionary for tuning cloud-native systems.

Future Extensions

Polar Signals invites community input to expand presets for additional runtimes (Java, Node.js, .NET). Early adopters can test the feature in Parca v0.22.0+ and share feedback via Discord.

Source: Polar Signals Blog: Introducing Off-CPU Profiling