RADV Picks Up RDNA3+ Instruction Prefetching For Faster Shader Startup in Mesa 26.2

Valve's Georg Lehmann wired up the INST_PREF_SIZE register that RDNA3 has carried unused since 2022, letting RADV prefetch shader code past what the command processor DMA reaches. It targets shader startup latency on GFX11 and GFX12, and it lands in next quarter's Mesa 26.2.

Mesa's Radeon Vulkan driver just gained a small but interesting optimization that has been sitting dormant in AMD silicon for nearly four years. Georg Lehmann of Valve's Linux graphics team merged a patch teaching RADV to use INST_PREF_SIZE, a hardware feature AMD introduced with RDNA3 (GFX11) that controls how many bytes of shader instructions the GPU fetches before a wavefront actually starts running. The change covers RDNA3 and RDNA4 parts and is slated for the Mesa 26.2 release next quarter.

RDNA3 graphics card

What INST_PREF_SIZE actually does

When a GPU launches a shader, the instruction bytes for that shader have to make their way into the instruction cache before the execution units can do anything useful. On a cold cache, the first few instructions of every shader invocation pay a latency penalty waiting on memory. AMD's command processor already does some prefetching through its DMA engine when it sets up a draw or dispatch, but that DMA path has limits on how far ahead it can pull instruction data.

INST_PREF_SIZE, added on GFX11, lets the driver specify a prefetch window measured in instruction bytes that the hardware fetches independently, reaching beyond what the command processor DMA can cover. As Lehmann put it in the merge request, the feature "was added on GFX11, to prefetch shaders even beyond what the [Command Processor] DMA can do. It should improve shader startup performance." The benefit is concentrated at the moment a shader first runs, so workloads that constantly swap between many distinct shaders, or that hit fresh pipelines mid-frame, stand to gain the most. Steady-state rendering where the same handful of shaders stay hot in cache will see little to nothing, which is the normal shape of a startup-latency win.

Why it took until 2026

The obvious question for anyone who tracks the Mesa tree is why a GFX11 register is only being plumbed in now, well into the RDNA4 generation. The answer is the usual one for driver work: the easy cases are easy and the hard cases dominate the schedule. Lehmann was candid that getting it right for the geometry pipeline stages was the holdup. "Getting this to work for LSHS/NGG is a bit of a pain because of shader objects, vertex shader prologs and because the registers contain more than just the prefetch size," he noted.

That last point is the crux. The register holding the prefetch size is shared with other state, so the driver cannot just blindly write a prefetch value without disturbing neighboring fields. Combine that with NGG (the Next Generation Geometry path RADV uses for vertex and geometry processing), merged LS/HS tessellation stages, and the runtime composition of vertex shader prologs and shader objects, and the bookkeeping to set a correct prefetch size in every configuration gets fiddly fast. Features like this tend to wait until someone has both the time and the appetite to handle every permutation rather than shipping a version that only works on the simple pixel and compute cases.

Twitter image

Review and provenance

The patch went through review from familiar names on the Radeon side. Samuel Pitoiset of Valve looked over the code, as did Marek Olšák, the longtime AMD driver developer who recently moved to Valve. Having three people who live in the RADV and RadeonSI code paths sign off on a change that touches shared registers is reassuring, since a mistake in this area would manifest as subtle corruption or hangs in specific geometry configurations rather than an obvious failure.

Valve's continued investment here is consistent with its broader push on the Linux graphics stack that backs Steam and the Steam Deck. RADV is the driver doing the heavy lifting for Proton gaming on AMD hardware, and shader startup latency is exactly the kind of thing that shows up as hitching when a game compiles and first runs a pipeline mid-session. You can follow the driver work in the Mesa GitLab repository.

What to expect on real hardware

Lehmann did not publish numbers, and that is worth setting expectations around. He described it as something that "should" improve shader startup performance without quantifying it in the merge request or the follow-up patch messages. For a prefetch tweak, that is normal: the gain depends heavily on shader size, cache pressure, and how often the workload introduces cold shaders. A title that streams in new effects and constantly touches fresh pipelines could see measurable smoothing of frame-time spikes, while a benchmark that runs a fixed render loop may report no change at all within run-to-run variance.

If you want to measure it yourself once it lands, the right approach is frame-time capture rather than average FPS. Tools like MangoHud for per-frame logging, paired with the existing RADV shader cache controls, will surface startup hitches far better than an averaged framerate counter that buries a 2ms spike across a hundred smooth frames. Comparing a cold first run of a level against a warmed-up second run is the kind of A/B that isolates startup latency from steady-state throughput.

For RDNA3 and RDNA4 owners running Mesa, this is a no-action upgrade: it arrives automatically with Mesa 26.2 and applies to GFX11 and GFX12 hardware with no configuration. Older GCN and RDNA1/RDNA2 parts lack the INST_PREF_SIZE register entirely and are unaffected. It is a modest, targeted optimization rather than a sweeping performance change, but filling in a hardware capability that shipped unused since the first RDNA3 cards is exactly the kind of incremental polish that keeps the open-source AMD stack closing the gap with what the silicon was designed to do.