In a fleet of seemingly identical servers, engineers at Vinted Engineering encountered a baffling inconsistency: some Redis instances suffered from inexplicable latency surges and ballooning CPU usage, while others hummed along efficiently. After meticulous investigation, the culprit emerged not from application code or network glitches, but from a fundamental Linux kernel setting—the clocksource. This discovery underscores how a subtle system misconfiguration can cripple performance, especially for latency-sensitive workloads like Redis.

The Stealthy Performance Killer: Clocksource Explained

At its core, a clocksource is the mechanism the Linux kernel uses to track time, crucial for everything from scheduling to I/O operations. Modern systems typically default to the Time Stamp Counter (TSC), a high-speed counter embedded directly in the CPU. However, when synchronization issues arise—such as after prolonged server downtime—the kernel may fall back to the High Precision Event Timer (HPET), a more accurate but slower alternative. As one kernel developer notes:

"TSC is orders of magnitude faster for frequent reads, but if the system detects inconsistencies across CPU cores, it defensively switches to HPET, trading speed for stability."

This switch, often invisible in routine monitoring, can have outsized effects. In Vinted's case, servers defaulting to HPET showed up to 30% higher CPU usage and doubled latency for Redis operations compared to those using TSC.

Unmasking the Issue Through Rigorous Testing

The investigation began with pattern recognition: slower servers consistently logged HPET as the active clocksource, with kernel messages hinting at TSC instability:

Apr 15 18:22:57 srv kernel: TSC synchronization [CPU#0 -> CPU#8]:
Apr 15 18:22:57 srv kernel: Measured 120 cycles TSC warp between CPUs, turning off TSC clock.

To quantify the impact, Vinted designed a controlled benchmark using Envoy proxies routing traffic to Redis. The test simulated real-world high-throughput scenarios:
- Setup: Two identical servers handled Redis commands via Go benchmark apps, scaling goroutines to increase load.
- Phases: Baseline (both on TSC), then alternating HPET on one server while the other remained on TSC.

Results were stark: servers on HPET immediately exhibited higher latency and CPU consumption under identical loads. Charts from the study (available in the source) visualized this—CPU usage spiked by over 25%, and Redis operations per second plummeted when HPET was active. This proved HPET's overhead isn't theoretical; it directly throttles throughput.

Why This Happens and How to Fix It

The fallback to HPET often occurs after extended server shutdowns, where hardware quirks or firmware issues cause TSC desynchronization. Rebooting sometimes resolves it, but not reliably. For engineers, the fix is straightforward:

# Check available and current clocksource
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

# Force TSC if supported
echo "tsc" | sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource

Vinted's timeline analysis showed servers booting with HPET after long downtimes, but manual intervention restored TSC and normalized performance. Proactively checking this setting during deployments or after maintenance can prevent costly degradation.

The Takeaway: Vigilance in System Tuning

For DevOps teams and developers, this case is a potent reminder that infrastructure performance hinges on low-level details. Sticking with TSC isn't just a tweak—it's essential for Redis, Kafka, or any system demanding microsecond precision. As cloud and on-prem environments scale, automating clocksource checks could save countless cycles in troubleshooting. In the relentless pursuit of efficiency, sometimes the smallest knob turns the biggest wheels.

Source: Vinted Engineering