nCPU: A Neural Network-Based CPU That Runs Entirely on GPU

A research project that implements a complete CPU architecture where all arithmetic operations are performed by trained neural networks instead of traditional logic circuits, achieving 100% accuracy on integer operations while running entirely on GPU hardware.

The nCPU project represents a fascinating exploration into whether neural networks can replace traditional CPU arithmetic logic. Rather than simulating a CPU on GPU, this research runtime implements a complete 64-bit ARM64 CPU where every ALU operation—addition, multiplication, bitwise logic, even shifts—executes as a forward pass through trained PyTorch models.

How It Works

At its core, nCPU treats the entire CPU as a GPU-resident system. All state—registers, memory, flags, and program counter—lives permanently on GPU as PyTorch tensors. There's no round-tripping to the host CPU during execution. The instruction fetch, decode, execute, and writeback pipeline all happen on-device.

What makes this approach unique is that every arithmetic operation routes through a trained neural network model:

Addition/Subtraction: Uses a Kogge-Stone carry-lookahead network with 8 neural passes, achieving 100% accuracy
Multiplication: Employs a byte-pair lookup table with up to 64 pairs for 64-bit operations
Bitwise Operations: Vectorized neural truth tables handle AND, OR, XOR across all 32 bits simultaneously
Shifts: Attention-based bit routing per output position replaces traditional barrel shifters
Math Functions: Sine, cosine, square root, exponential, logarithm, and arctangent all use trained MLPs or residual networks

Performance Characteristics

Benchmarked on Apple Silicon with PyTorch 2.10.0, the neural CPU achieves surprising performance patterns. Multiplication runs in just 21 microseconds using the byte-pair LUT—faster than addition's 248 microseconds with carry-lookahead. This inverts the conventional CPU hierarchy where MUL is typically slower than ADD.

Key findings from the benchmarks:

Multiplication is 12x faster than addition due to the LUT's zero sequential dependency versus carry-lookahead's O(log n) stages
Carry-lookahead works in neural networks: The Kogge-Stone algorithm reduced ADD/SUB/CMP from ~826us to ~248us (3.3x speedup)
Vectorization recovers attention costs: Shifts improved from ~2,833us to ~434us (6.5x speedup)
Three-tier operation hierarchy: O(1) single-pass lookups (21us), O(log n) parallel-prefix carry (248us), and O(n) sequential passes (sqrt ~522us, atan2 ~935us)

Two Execution Modes

The project offers flexibility through two modes:

Neural Mode (default): Every ALU operation is a forward pass through a trained .pt model. This demonstrates the core research concept but runs at ~136-262 microseconds per cycle.

Fast Mode (--fast): Uses native PyTorch tensor operations (torch.add, torch.mul) instead of model inference. This targets 1.35M IPS at batch_size=32768 on Apple Silicon MPS, showing the architecture's potential when not constrained by neural inference.

Architecture Details

nCPU implements a complete 64-bit ARM64 instruction set. The text assembly interface supports standard operations like MOV, ADD, SUB, MUL, DIV, AND, OR, XOR, SHL, SHR, INC, DEC, CMP, and various conditional jumps. There's also a binary mode that decodes and executes real ARM64 instruction encodings.

The project includes 23 trained models totaling ~135MB, with 13 actively wired into the execution pipeline. These models were trained to achieve 100% accuracy on integer arithmetic, verified through 347 automated tests.

Practical Applications

Beyond the research value, nCPU includes practical demonstrations:

DOOM Raycaster Demo: A DDA raycaster that runs all arithmetic through trained neural networks, achieving ~2.5 FPS in neural mode versus ~5,000 FPS in fast mode
Assembly Program Support: Users can write and execute assembly programs with neural arithmetic
Inline Assembly: Direct execution of assembly instructions through the CLI

Why This Matters

The nCPU project challenges fundamental assumptions about CPU design. By demonstrating that neural networks can implement traditional CPU operations with perfect accuracy while offering unique performance characteristics, it opens questions about hybrid architectures where certain operations might benefit from learned implementations rather than hardcoded logic.

For ML engineers and computer architects, nCPU provides a concrete platform to explore these boundaries. The project's success with carry-lookahead neural networks and vectorized operations suggests that classical hardware design principles can indeed transfer to neural architectures—sometimes with surprising benefits.

The research shows that while neural CPUs aren't ready to replace traditional designs, they offer a valuable perspective on how machine learning might augment or transform fundamental computing operations in the future.

GitHub Repository | Documentation