Building an FPGA 3dfx Voodoo with Modern RTL Tools

How SpinalHDL and netlist-aware debugging made a complex fixed-function GPU reimplementation tractable for one person

This frame of Screamer 2 was rendered not by an original 3dfx card and not by an emulator, but by an FPGA reimplementation of the Voodoo 1 that I wrote in SpinalHDL. Available on GitHub. What surprised me was not just that it worked. It was that a design like this can now be described, simulated, and debugged by one person, provided the tools let you express the architecture directly and inspect execution at the right level of abstraction.

The Voodoo 1 is old, but it is not simple. It has no transform-and-lighting hardware and no programmable shaders, so all of its graphics behavior is fixed in silicon: gradients for Gouraud shading, texture sampling, mipmapping, bilinear and trilinear filtering, alpha clipping, clipping, depth testing, fog, and more. A modern GPU concentrates much of its complexity in flexible programmable units. The Voodoo concentrates it in a large number of hardwired rendering behaviors.

A Fixed-Function Chip That Is Harder Than It Looks At first glance, the Voodoo looks almost modest. It is a memory-mapped accelerator with one job: render triangles as quickly as possible. Unlike later accelerators, it does not do transform and lighting, which means the host CPU handles the heavier 3D math. That can make the hardware sound simpler than it really is.

Even a single triangle may involve interpolated colors, texture sampling, mip level selection, bilinear or trilinear filtering, alpha clipping, depth comparison, clipping, and fog. None of these operations are programmable in the modern sense. They are all baked into the silicon.

That is the central contrast. In modern GPUs, complexity often comes from flexibility. In the original Voodoo, complexity comes from how many rendering behaviors are directly encoded in fixed-function hardware.

Why Register Writes Cannot All Behave the Same Way That fixed-function style shows up clearly in the register interface. On the Voodoo, writing to triangleCmd or ftriangleCmd launches a triangle. The other registers in the register bank describe how that triangle should be rasterized: which gradients to use, how textures should be sampled, which tests should run, and so on.

The catch is that the Voodoo is deeply pipelined. Rendering a pixel involves a series of stages: stepping gradients, sampling textures, combining colors, comparing against the depth buffer, and more. Pipelining slices that work into stages so multiple pixels can be in flight at once. That is how the chip achieves throughput that software cannot match.

Figure 1: The hard part is deciding which register writes may apply immediately, which must move with in-flight work, and which must wait until the pipeline is empty.

But pipelining creates a problem for the register model. Imagine triangle A is still moving through the pipeline while the CPU starts configuring triangle B. If a rendering setting changes too early, late pixels from triangle A may see state intended for triangle B. The result is subtle corruption: part of a triangle rendered with the wrong texture mode, wrong blending mode, or wrong depth behavior.

There are only two ways around that. Either a register write waits until the pipeline has drained before taking effect, or the write travels forward in step with the in-flight work so each triangle sees the state that belongs to it. In other words, register writes on the Voodoo are not just configuration updates. They are part of the timing contract of the machine.

The Voodoo's Four Register Behaviors In my model, Voodoo registers fall into four categories:

Type Behavior FIFO Enqueued and applied in order FIFO + Stall Enqueued, but only applied once the pipeline has drained Direct Applied immediately Float Converted, then written to the fixed-point form of a register

The important point is that these categories are architectural, not just software-facing. A register type tells you whether a piece of state can change immediately, whether it must move with in-flight work, or whether it must wait for the machine to become quiescent.

Figure 2: Why the register categories exist at all: without them, new state can bleed into old work.

That distinction turns out to be a very natural thing to model directly in the HDL.

Encoding Register Semantics in SpinalHDL

Figure 3: Hardware (Mine, left) vs reference (86Box, right). The symptom looked like a framebuffer hazard: a few blended overlay pixels would be lost while most of the frame remained correct.

The real issue was not one catastrophically broken block. It was a stack of small hardware-accuracy mismatches that only became visible together. The first problem was precision. Float-triangle W was being quantized too early as it passed through the TMU path. The second was that perspective texcoord rounding and per-pixel LOD adjustment were slightly off near mip boundaries. The third was in blending: I was using the expanded destination color for blend-factor math, but real Voodoo behavior effectively wants the dither-subtracted destination color instead.

Each of those behaviors was almost right in isolation. Together, on exactly the right class of blended textured primitives, they produced visibly wrong pixels. That is why the bug felt random. Most of the frame was fine, and even the failing path was only wrong in a narrow corner of the state space.

The fix was to stop arguing from the first plausible theory and instead match the machine stage by stage. I preserved wider W, S, and T accumulators, corrected the perspective rounding and LOD math, and fed dither-subtracted destination color into the blend-factor computation. Once those details matched the reference behavior, the "memory-ordering bug" disappeared, because it had never been a memory-ordering bug at all.

A conventional waveform viewer can show every signal involved here, but it leaves most of the reconstruction to the engineer. A netlist-aware query tool moves some of that reconstruction into the tooling itself. On a design like the Voodoo, that difference is the gap between a plausible theory and an actual explanation.

What Modern RTL Tools Actually Changed The Voodoo 1 is not simple because it is old. It is difficult in a very specific way: its behavior is fixed in silicon, so much of the complexity lives in control paths, register semantics, and pipeline timing rather than in programmability.

What modern RTL tools changed for me was not the amount of complexity in the design. They changed how much of that complexity I had to hold in my head at once. SpinalHDL let me encode architectural intent directly in the source instead of scattering it across declarations, bus logic, and documentation. Conetrace let me inspect execution in terms closer to the structure of the design than a raw waveform usually allows.

That combination is what made an FPGA reimplementation of the Voodoo feel tractable for one person. The machine is still complicated. But with the right abstractions, more of that complexity becomes representable, queryable, and therefore manageable.

#FPGA #SpinalHDL #GPU #Fixed-Function #RTL

Building an FPGA 3dfx Voodoo with Modern RTL Tools

Comments