Chasing the Missing Millisecond: What One Engineer's Latency Hunt Reveals About Linux Gaming
#Backend

Chasing the Missing Millisecond: What One Engineer's Latency Hunt Reveals About Linux Gaming

Tech Essays Reporter
8 min read

A meticulous home experiment with a microcontroller and a light sensor traces the floaty feel of Linux gaming to a single overcautious compositor, and in doing so exposes how much performance hides inside the margins we build for safety.

There is a particular kind of discomfort that drives the best technical investigations, and it rarely shows up on a benchmark. It is the feeling that something is wrong before you can prove it. The developer behind farnoy.dev describes exactly this: after moving back to Linux from Windows, the mouse in games sometimes felt floaty, untethered, slightly behind the hand. No frame counter would flag it. No stutter graph would catch it. The system was running at a clean 120 frames per second on a 120 Hz panel, and yet the felt experience betrayed a deficit that the numbers refused to confirm. The resulting deep investigation into Linux latency and compositor tuning is one of the more rigorous pieces of amateur systems archaeology I have read in some time, and its real subject turns out to be larger than gaming.

Featured image

The central argument, once you follow the measurements to their conclusion, is this: the latency gap between Linux and Windows in this setup was not an architectural failing of the open stack. It was an accumulation of safety margins, rounding behaviors, and conservative defaults, each individually defensible, that together produced a perceptible lag. The story is less about Linux being slow and more about how engineered caution compounds into a tax that nobody specifically intended to impose.

Measuring what the eye can barely feel

The methodology deserves attention because it sets the epistemic standard for everything that follows. Rather than trust subjective impressions, the author built a click-to-photon rig from a Teensy microcontroller acting as a USB HID mouse, paired with a light sensor pressed against the screen and flashed with a modified open source LDAT sketch. The device clicks, then watches for the pixels to change, and logs the interval. Hundreds of samples per configuration, written unattended to CSV. Two machines, both with Ada-generation RTX cards and Zen 4 processors, both running nearly identical NixOS configurations alongside fresh Windows 11 installs, both feeding the same LG C1 at 120 Hz.

What makes the work credible is the honesty about its own messiness. The author catalogs the variables that kept sabotaging the runs: the LG panel silently toggling Black Frame Insertion when a different computer connected to the same HDMI port, a Konsole window triggering large wl_shm surfaces that crawled across PCIe and trained the compositor to expect the worst, V-Sync changes that refused to apply until a restart. Controlling for these is the unglamorous core of real measurement, and the writeup treats each discovered confound as a finding rather than an embarrassment. This is the difference between a benchmark and an investigation. A benchmark tells you a number. An investigation tells you why the number lied to you the first three times.

The accidental discovery that reframed everything

The pivotal moment arrives almost by accident. A synthetic test, just a black square that flashes white on click, showed the desktop running slower than the laptop despite near-identical hardware and software. That inversion made no sense. A fresh user account on the desktop closed most of the gap, which meant the culprit lived somewhere in the user profile rather than the hardware. By closing applications at random, the author isolated it: an open Zed editor window, sitting idle in the background, was adding at least 3 ms of latency to every other application on the screen.

An idle window should cost nothing. The fact that it cost milliseconds is the thread that, when pulled, unravels the whole garment. The reason Zed mattered is that it presents a new frame every single refresh interval through a Wayland frame callback, producing a constant 120 FPS and keeping the KWin compositor perpetually occupied. A background editor was behaving, from the compositor's perspective, like a demanding animation that never paused.

Where the latency actually lives

The game tests across Doom Eternal on Vulkan, Borderlands 3 on both DX11 and DX12, and Hades 2 on DX12 produce a consistent set of practical recommendations: prefer native Proton Wayland via PROTON_ENABLE_WAYLAND=1, cap the frame rate just below refresh to stop frames from queueing, set VKD3D_SWAPCHAIN_LATENCY_FRAMES=1 for DX12 titles, and reach for variable refresh rate only when a game cannot hold a stable target. DX12 emerges as consistently slower than DX11 across both operating systems, a finding that points more at engine and API design than at the translation layers people usually blame. Notably, the Nvidia-exclusive low-latency toggles on Windows moved the needle far less than the simple discipline of not letting frames pile up at the refresh ceiling.

The networking experiments are a worthwhile detour, demonstrating that USB/IP and Moonlight input relay both add latency cleanly proportional to network delay, with the practical implication that you really could exile your gaming hardware to the basement and route input over 2.5 GbE without paying a hidden penalty. But these results, useful as they are, function as supporting evidence for the larger claim that the compositor, not the wire, holds the interesting slack.

The anatomy of caution

The KWin section is where the piece earns its ambition. By instrumenting the compositor and capturing a frame mid-test, the author decomposes a roughly 9.78 ms input-to-present interval into its constituent decisions. KWin deliberately delays the start of compositing to give other clients a chance to report damaged surfaces, a genuinely reasonable design that lets multiple windows join a single frame. But it overestimated its own render time by about 2.39 ms, budgeted another 1.34 ms of safety margin, and the ideal lower bound, with all slack removed, sat around 3.07 ms.

Then comes the case that explains the floaty feeling. With a busy client occupying the compositor, a second window received its input 6.4 ms before the pageflip, more than enough time in principle, and still missed it, slipping a full frame to 14.74 ms total. The window of opportunity for the second client had shrunk under the weight of the first. This is the latency cliff, and it is not a bug in the conventional sense. It is pessimism encoded as policy.

The author then performs the genuinely valuable labor of locating each constant. A fixed 1 ms term added for scheduler inaccuracy, which turned out to be partly forced by Qt rounding every timer duration up to the next millisecond on Unix, addressed by rolling a custom timerfd-based timer with measured wakeup deviation of just 51 µs at p99. A safety margin defaulting to 1000 µs, which the latest Nvidia drivers tolerated being pushed into negative territory. A 2 ms floor on measured GPU compositing time, replaced with an adaptive estimate drawn from a ring buffer of the last 512 frames, against a measured real render time of 0.36 ms at the median. Each of these constants made sense when written. Each was a bet that hardware was less capable or less predictable than it actually is on a modern machine.

The deeper pattern

What the patched compositor demonstrates is that the floaty feeling had a real cause and a real fix. Minimum input-to-present latency dropped into the 3 ms range, fairness was restored so that a background animation no longer starved foreground windows, and the gains in windowed applications were substantial. The improvement against Windows was more modest in absolute terms, roughly 1.1 to 1.2 ms recovered against a best-case gap of around 4 ms, and the author is scrupulous about not overselling it. Fullscreen VRR games, which bypass the compositor's scanout path, see little benefit. This is not a victory lap. It is a careful accounting.

The broader lesson connects to something true across all systems software, not just compositors. Safety margins are written by engineers who cannot know the hardware their code will eventually run on, so they choose conservative constants that protect the worst case. A millisecond of slack is trivial at 60 Hz, where a frame lasts 16.7 ms. It becomes 36 percent of a frame at 360 Hz and 54 percent at 540 Hz, refresh rates that did not meaningfully exist when many of these defaults were chosen. The constants did not change. The hardware moved out from under them. The cost of caution is invisible until someone with a light sensor and patience makes it visible.

There is a tension here that the author navigates honestly, admitting to disliking the adaptive solution that replaced the hard floor even as it worked. Removing safety margins to claw back latency risks dropped or slipped frames, trading responsiveness for smoothness, and the stated plan to daily-drive the patched build while measuring slipped frames before submitting upstream is exactly the right discipline. Aggressive tuning that ships latency improvements at the cost of occasional stutter would simply relocate the discomfort rather than resolve it.

What this kind of work is for

The counter-perspective worth holding is that almost no user will ever feel 1.1 ms. The entire investigation chases differences at the threshold of human perception, and a reasonable person might ask whether the effort is proportionate. But that framing misses what the work actually produces. The value is not the millisecond. The value is the map. By tracing exactly where latency accumulates in the stack, from the kernel's DRM commit thread on SCHED_RR, through the render journal's adaptive estimator, up to Qt's timer rounding, the author has turned a vague community complaint into a set of specific, addressable mechanisms. The future work list, including FIFO_LATEST_READY_EXT to avoid queuing at the refresh ceiling and VK_EXT_present_timing for properly paced frame limiting, reads as a research agenda that others can now extend.

This is how open systems improve. Not through a single rewrite that declares the old approach obsolete, but through someone willing to instrument the code, name the constants, measure the slack, and submit the difference back. The compositor was never broken. It was cautious in ways that made sense once and stopped making sense quietly. Finding that out required treating a feeling as a hypothesis and building the instrument to test it, which is, in the end, the whole discipline compressed into one stubborn refusal to accept that the mouse only seemed slow.

Comments

Loading comments...