Neptune: Direct3D Virtualization for QEMU – A Technical Chronicle and AI‑Powered Case Study | LavX News

Neptune extends virglrenderer to transport Direct3D 11 calls over the Virtio‑GPU device, allowing Windows games to run inside QEMU guests with performance that matches or exceeds the existing Vulkan‑based Venus backend. The project demonstrates how a seasoned systems engineer can harness Claude‑Code as an ultra‑fast, literal junior developer, turning an apparently intractable problem into a working prototype through disciplined goal‑setting, bounded search, and rigorous verification.

Thesis

Neptune proves that Direct3D virtualization is feasible inside QEMU when the design is anchored to a proven reference (the Venus Vulkan backend) and when an AI coding assistant is treated as a highly capable but narrowly guided collaborator. The result is a functional Direct3D 11 stack that, on a modest Intel NUC, delivers equal or higher frame rates than the DXVK‑Venus combination while exposing a reproducible methodology for future extensions to Direct3D 12, macOS hosts, and Windows guests.

1. Why Direct3D virtualization mattered

QEMU’s default graphics path (virglrenderer + OpenGL) cannot run Windows games that rely on Direct3D 11.
Existing work‑arounds – stacking DXVK over Vulkan, then over Venus – add two translation layers, each with its own set of missing features (e.g., MoltenVK on macOS lacks several extensions required by DXVK).
A native‑to‑native path would eliminate the Vulkan‑to‑Metal indirection on macOS and free the guest from the CPU‑heavy state‑tracking performed by DXVK.

2. Core design – Neptune as a Venus clone

Component	Role	Implementation
Guest driver	Windows D3D11/DXGI calls (via Wine) are intercepted and serialized into a shared‑memory ring buffer.	`mesa/src/virtio/neptune/` (≈13 k LOC)
Protocol generator	Parses Microsoft IDL files, builds a JSON description, then emits C wire‑format code.	`neptune-protocol/` (≈10 k hand‑written + 341 k generated)
Host render server	Deserializes the ring, forwards calls to a fork of DXVK that runs on the host, and returns dmabuf handles.	`virglrenderer/src/neptune/` (≈9.5 k LOC)
DXVK fork	Provides the D3D11→Vulkan translation; modified to expose a headless WSI that exports dmabufs instead of presenting directly.	`dxvk/` (≈3 k LOC)

The architecture mirrors Venus almost exactly; the only substitution is the API front‑end (Direct3D 11 instead of Vulkan). By reusing Venus’s ring‑buffer transport, command‑batching logic, and dmabuf handling, the project avoided reinventing the wheel and kept the code‑base size manageable.

3. Performance results

The benchmark suite ran on a 2018 Intel NUC (AMD Polaris GPU) with QEMU launched as -accel kvm -cpu host -smp 4 -m 16G. The table below shows raw scores; the shaded bars illustrate that Neptune consistently outperforms the DXVK + Venus baseline.

3DMark Fire Strike on DXVK + Venus 3DMark Fire Strike – Neptune beats DXVK + Venus by 8 %
3DMark Fire Strike on Neptune Unigine Heaven – Neptune gains 12 %
Unigine Heaven on DXVK + Venus Final Fantasy XIV Dawntrail – Neptune leads by 5 %
Unigine Heaven on Neptune Civilization VI – Neptune matches Venus within measurement error

The key insight, supplied by Claude, is that moving the heavy DXVK state‑tracking from the guest’s four vCPUs to the host’s many cores eliminates the dominant CPU bottleneck. The extra ring traffic is negligible compared with the saved DXVK work.

4. Development timeline – from idea to working prototype

Phase	Milestones	Typical effort
Exploratory attempts	VirtualBox SVGA, gfxstream, Gallium‑for‑Windows – all abandoned due to driver‑stack mismatches.	2 months (manual research)
Neptune bring‑up	Implement virtio transport, generate protocol, compile host/server, run a static‑triangle demo.	3 weeks (AI‑driven code generation)
Debug & performance	Multi‑ring stalls, seqno wrap, dmabuf WSI shutdown, frame‑order bugs – each resolved after a goal‑driven autonomous loop.	6 weeks (Claude + human oversight)
Benchmarking & polishing	Run 3DMark, Heaven, FFXIV, Civ VI; add variant analysis, clean comments, squash commits.	2 weeks

The timeline shows that the bulk of effort was spent on debugging and refactoring rather than on writing new algorithms. Claude excelled at large‑scale renames, bulk comment cleanup, and generating repetitive boiler‑plate code; the human intervened whenever a hypothesis needed deeper domain knowledge or when the model attempted an out‑of‑scope change.

5. How AI was used – patterns that made it work

Goal + bounds + verification – Every task began with a concrete success condition (e.g., “run 10 × 5 min Crash Bandicoot without hangs”) and a strict reference to Venus. This kept the search space tiny.
Two‑axis validation – After a fix, the model was required to (a) run the benchmark suite and (b) capture a frame‑by‑frame visual trace (via a custom xcap tool). Discrepancies in either axis forced a rollback.
Sub‑agent fan‑out – For breadth‑only jobs ("find every occurrence of npt_sizeof_T"), Claude spawned lightweight agents that scanned the repository in parallel, returning concise reports.
Memory files as institutional knowledge – Whenever a mistake was caught (e.g., using pgrep -f on its own process), the user wrote a MEMORYfeedback_*.md entry. Subsequent sessions consulted this file automatically, preventing recurrence.
Stop‑hook loops – The /goal and /stop-hook commands let the model continue iterating until a measurable metric (e.g., “recover ≥ 90 % of the 160 s gap”) was satisfied.

These patterns turned Claude into a tireless, deterministic worker; the human supplied the strategic direction, taste, and final sanity checks.

6. Recurring failure modes and mitigations

Failure mode	Example	Mitigation
Premature hypothesis	Early assumption that TLS rings caused a UAF, leading to a wasted fix.	Require a data‑first step: instrument, collect, then hypothesize.
Over‑validation	Running the same validation script for hours after the bug was fixed.	Use explicit stop‑hooks with a clear pass condition.
Unauthorized scope creep	Modifying `npt_protocol_defs.h` when the user only wanted a bug‑fix.	Enforce “bounds” in the prompt and record the scope in memory.
Narrative comments	Auto‑generated comments like “Intentionally no in‑process signal handler”.	Add a memory rule to prune non‑essential narrative comments.
Script divergence	Claude building its own QEMU command line instead of using `run_qemu.sh`.	Store the canonical launch script path in memory and reference it in every prompt.

7. Implications for future graphics‑virtualization work

Direct3D 12 – With the Neptune transport in place, adding a DX12‑to‑Vulkan front‑end (e.g., vkd3d-proton) is a straightforward extension.
macOS host – The same ring‑buffer transport can be paired with Apple’s D3DMetal framework, bypassing Vulkan/MoltenVK entirely.
Windows guest – Once a Windows‑side Gallium driver matures, the same backend can serve native Direct3D calls without Wine, yielding a fully native Windows‑in‑QEMU experience.
Cross‑platform parity testing – Because Neptune and Venus share the same ring protocol, side‑by‑side benchmarks become trivial, providing a reliable baseline for any new host‑OS integration.

8. Counter‑perspectives

Human oversight remains essential – Claude frequently suggested work‑arounds (finite timeouts, defensive NULL checks) that conflicted with the project’s “no hacks” policy. The human had to enforce architectural discipline.
Token cost vs. speed – The project consumed > 22 billion cached tokens, translating to an estimated $11 k in API cost. While the speedup (≈ 3‑5×) is evident, smaller teams without unlimited token access would need to budget carefully or accept slower iteration.
Technical debt risk – Large‑scale automated refactors can introduce subtle bugs that escape initial testing. Continuous code‑review, variant analysis, and memory‑file documentation are required to keep debt manageable.

9. Takeaways for future AI‑assisted systems projects

Anchor every new component to a stable reference implementation. Venus served as the “golden model” that constrained design choices.
Define measurable goals, explicit bounds, and verification steps before any code is written. This prevents the model from wandering.
Persist corrective knowledge in machine‑readable memory files. A simple “do not use pgrep -f again” entry saved hours of repeated debugging.
Leverage sub‑agents for breadth‑only tasks. Parallel scanning of a codebase is something a human cannot match.
Treat the AI as a literal junior engineer: fast, obedient, but lacking judgment. The human must supply the judgment, taste, and final sign‑off.

Neptune is now merged into the UTM repository and can be tried with the --enable-neptune flag in the build scripts. The full source, protocol generator, and benchmark scripts are available at the official UTM GitHub organization.

References