Fabio Guzman claims custom digital chip runs microGPT at more than 56,000 tokens per second

Fabio Guzman says he built a gate-level Transformer chip prototype that runs Karpathy’s microGPT without a CPU or GPU.

Fabio Guzman said Friday he built a 100% digital integrated circuit that runs a small Transformer model with a KV cache at more than 56,000 tokens per second on an FPGA prototype.

Guzman posted the claim on X, saying he designed the chip gate by gate and ran Andrej Karpathy’s microGPT character model on pure digital silicon logic. He said the prototype runs at 80 MHz and spells out names from the model output.

The post did not name a company, funding round, investors or commercial launch plan. Guzman framed the work as a custom chip experiment, so readers should treat the performance claim as an early technical demo unless he publishes RTL, synthesis data, model details and benchmark conditions.

The interesting part sits in the design choice. Most AI inference runs on CPUs, GPUs or neural accelerators that execute instructions or tensor programs. Guzman says he mapped a full Transformer, including the KV cache, into fixed digital logic. That means the chip’s gates implement the data path directly instead of asking a programmable processor to schedule each operation.

That approach can cut overhead for a tiny model. A Transformer inference step repeats matrix multiplies, normalization, activation functions and attention operations. A fixed circuit can stream values through those operations with less control logic than a general processor needs. The trade-off comes from flexibility. A GPU can run many models. A gate-level design favors the model shape its designer chose.

Karpathy’s microGPT family of educational projects uses small models to teach neural network mechanics. A character-level model that spells names gives chip designers a useful target because it keeps memory, vocabulary and arithmetic demands small enough for an FPGA. That makes the demo plausible as a proof of architecture rather than a direct threat to data center GPUs.

The KV cache matters because autoregressive Transformers reuse past keys and values during generation. Without a cache, the model would recompute attention history for each new token. With a cache, the circuit stores prior attention data and appends new entries as it generates text. Hardware designers care about that memory pattern because bandwidth and layout can dominate performance once the math pipeline grows.

Guzman’s 80 MHz figure also needs context. Modern GPUs run far faster clocks and process large tensors with thousands of parallel lanes. A fixed-function FPGA design can still post a high token-per-second number on a small model because each token requires less work. The right comparison would specify parameter count, context length, precision, batch size, power draw and output quality.

Investors have funded custom AI chip companies for years, from cloud accelerators to edge inference startups. Those companies sell around three promises: lower latency, lower power and predictable cost for a defined workload. Guzman’s project fits the same market logic at prototype scale. If a team can compile narrow Transformer workloads into compact silicon, buyers in embedded systems, robotics and low-power devices may care.

The hard part starts after the demo. A useful product needs a toolchain that maps model graphs into hardware, tests numerical behavior and handles model updates. Teams also need enough memory near the compute blocks to avoid starving the circuit. A design that looks elegant for microGPT can hit routing, SRAM and verification limits when engineers scale it to a larger model.

Still, the post points at a real direction in AI hardware. Developers use small language models in places where a GPU costs too much, draws too much power or adds too much latency. A fixed digital circuit for a narrow model can serve that niche if it gives engineers a clear deployment path and stable accuracy.

Guzman has not shared enough public detail to support a full benchmark comparison. The claim deserves interest and skepticism in equal measure. The next useful artifacts would include source code, RTL or gate-level files, FPGA part details, precision format, model size, measured power and a reproducible test harness.

#FPGA #AI inference #custom chip #microGPT #Hardware_Design

Fabio Guzman claims custom digital chip runs microGPT at more than 56,000 tokens per second

Comments