Accelerating wl_shm Buffer Uploads in KWin with udmabuf and Vulkan

The article explains how CPU‑rendered Wayland applications that use wl_shm buffers suffer from costly double‑copy uploads to the GPU. By wrapping shm memory in a dmabuf via the udmabuf driver and importing it directly into Vulkan, the compositor can eliminate the copies, dramatically reducing CPU usage and improving UI smoothness. The technique will land in Plasma 6.7 and Qt 6.11.2 and is recommended for any toolkit that still relies on shm buffers.

Thesis

While most modern Wayland clients render with the GPU, a substantial number of applications – especially those built on QtWidgets – still rely on CPU rendering and present their frames through wl_shm shared‑memory buffers. The traditional compositor path copies these buffers twice before they reach GPU memory, a process that becomes a noticeable bottleneck on high‑resolution displays. By leveraging the udmabuf Linux driver together with Vulkan extensions, the compositor can import the shared memory directly as a GPU‑accessible resource, eliminating the copies and slashing CPU load.

Key Arguments

The cost of wl_shm uploads
- A wl_shm buffer lives in ordinary system RAM. To display it, the compositor must first copy the data into a GPU‑compatible buffer (CPU‑side copy) and then copy that buffer into GPU memory (GPU‑side copy). The first copy blocks the compositor’s main thread, while the second still traverses system memory on integrated GPUs, inflating both latency and CPU utilization.
- Real‑world observation on a Ryzen 7840U laptop showed cursor jitter and 80‑90 % single‑core CPU usage when scrolling in KDevelop, a clear symptom of the upload path being a performance choke point.
Why Vulkan alone does not solve the problem
- The extension VK_EXT_external_memory_host lets Vulkan wrap a host pointer in a VkBuffer or VkImage, enabling asynchronous GPU copies. However, on AMD drivers this path is blocked for memory that originates from a file descriptor – exactly how wl_shm buffers are allocated – due to security constraints.
- VK_EXT_host_image_copy can drop the second copy but leaves the first copy untouched, so the primary bottleneck remains.
udmabuf as the missing bridge
- udmabuf is a kernel driver that creates a dmabuf handle from a memfd (the typical allocation method for wl_shm). A dmabuf is a cross‑process, GPU‑friendly buffer descriptor.
- The only requirement is that the underlying memory be page‑aligned. Since most toolkits allocate one memfd per shm buffer, rounding the allocation size up to the page boundary incurs virtually no extra cost.
- When the stride (row size in bytes) aligns to a driver‑friendly multiple (commonly 256 bytes), the dmabuf can be imported directly as a VkImage without any copy at all.
Implementation details
- In KWin, the compositor now attempts to create a udmabuf for each incoming wl_shm buffer. If successful, it imports the dmabuf into Vulkan via vkImportMemoryFdKHR and binds it to a VkImage. If the import fails, the legacy double‑copy path is used as a fallback.
- On the client side, only a few lines of Qt code needed to be adjusted to allocate the shm buffer with page‑aligned size and appropriate stride padding. The change is minimal – 18 lines in total – yet yields a massive performance gain.
Quantitative results
- After the patch, the same KDevelop scrolling test shows CPU usage dropping from ~85 % to ~20 % on a single core.
- Cursor movement becomes perfectly smooth even under the "power‑save" profile, confirming that the compositor’s main thread is no longer blocked by texture uploads.

Implications

For KDE Plasma – The optimisation will be part of Plasma 6.7 and Qt 6.11.2, meaning that the majority of KDE applications that still rely on QtWidgets will immediately benefit.
For other toolkits – GTK, SDL, or custom frameworks that use wl_shm can adopt the same page‑aligned memfd + udmabuf strategy, gaining comparable reductions in CPU load without rewriting their rendering pipelines.
Energy efficiency – Lower CPU usage translates directly into reduced power draw, a crucial factor for laptops and ARM‑based devices where integrated GPUs share system memory.
Future compositor designs – The success of this approach suggests that compositors should treat wl_shm as a first‑class GPU resource rather than a legacy fallback, encouraging broader adoption of dmabuf‑based zero‑copy paths.

Counter‑Perspectives

Driver support variability – The current implementation relies on AMD’s driver accepting dmabuf imports from memfd. While Intel and Nvidia drivers also expose dmabuf import APIs, subtle differences in security policies could require per‑vendor work‑arounds.
Memory overhead – Aligning allocations to page size and padding rows to a 256‑byte stride introduces a modest memory overhead (≈1‑2 % for 4K images). For systems with many concurrent shm buffers this could become noticeable, though still far outweighed by the CPU savings.
Complexity vs. benefit for low‑resolution devices – On low‑DPI screens the double‑copy cost is minimal; developers may decide the added code path isn’t justified for such environments.

Conclusion

By converting wl_shm buffers into dmabuf objects with udmabuf, and then importing them directly into Vulkan, KWin eliminates the expensive CPU‑side copy that has long plagued CPU‑rendered Wayland clients. The result is a dramatic drop in compositor CPU usage and a smoother user experience, especially on high‑resolution displays. The approach is lightweight, requires only minor changes in the client toolkit, and stands as a compelling example of how thoughtful kernel‑level plumbing can unlock performance gains without overhauling existing rendering code.

Further reading