When Andrej Karpathy published "The Unreasonable Effectiveness of Recurrent Neural Networks" in 2015, it ignited mainstream developer fascination with AI. At a time when many perceived an "AI winter," Karpathy demonstrated how RNNs could ingest raw bytes of Shakespeare and generate startlingly coherent text—learning word formation, structure, and style from scratch.


alt="Article illustration 1"
loading="lazy">

Nearly a decade later, developer Giles Thomas revisited this watershed moment during a sabbatical, [reimplementing Karpathy's work in PyTorch](https://github.com/gilesgthomas/rnn-from-scratch) while drawing sharp contrasts with modern transformer-based LLMs. His exploration reveals why RNNs captivated engineers but ultimately hit fundamental limits that transformers would overcome. ## The Byte-Level Brilliance That Started It All Karpathy's approach was radically minimalist: unlike today's token-based LLMs, his RNNs processed raw bytes (ASCII characters) using one-hot encoding. With no predefined vocabulary or token boundaries, networks learned linguistic concepts purely from sequential patterns:
# Simplified RNN input processing (Karpathy-style)
byte_vocab = sorted(set(raw_text))  # All unique bytes
input_tensor = torch.zeros(len(text), len(byte_vocab))
for i, char in enumerate(text):
    input_tensor[i, byte_vocab.index(char)] = 1  # One-hot encoding

This byte-level processing made the results astonishing—networks progressed from gibberish to quasi-Shakespearean prose, even grasping play script formatting. The technique spawned viral memes (like AI-generated "Friends" scripts) by demonstrating pattern recognition previously thought impossible. ## The Fixed-Length Bottleneck That Doomed RNNs Despite Turing completeness in theory, RNNs faced a critical constraint: the hidden state. As Thomas explains: > "RNNs process sequences step-by-step, compressing all prior context into a fixed-size hidden state vector. While mathematically capable of infinite memory, 32-bit floats physically limit context to ~10^38 distinct states—far less than meaningful language requires." This bottleneck manifested practically through **vanishing gradients** during training. Backpropagation Through Time (BPTT) "unrolls" RNNs into ultra-deep networks (sequence_length × layers). For a 2-layer LSTM processing 1,000 characters, gradients must propagate through 2,000 virtual layers—causing signal decay or explosion:
RNN Unrolled (Sequence Length=3, Layers=2):
Input1 → Layer1 → Layer2 → Output1
          ↓         ↓
Input2 → Layer1 → Layer2 → Output2
          ↓         ↓
Input3 → Layer1 → Layer2 → Output3

Karpathy mitigated this via **truncated BPTT**—updating weights every 100 characters while preserving hidden state continuity—but the core limitation remained.

Transformers: Trading Fixed Memory for Quadratic Complexity

Modern LLMs circumvent RNN limitations through attention mechanisms and parallel processing:




























Aspect RNNs/LSTMs Transformer LLMs
Context Handling Fixed-size hidden state Context vectors scale with input length
Training Complexity O(sequence_length) time O(sequence_length²) time
Parallelization Sequential by design Fully parallelizable
Tokenization Byte/character level Subword tokens (BPE/WordPiece)
Transformers pay a computational price (quadratic attention scaling) but eliminate RNNs' context ceiling. As Thomas notes: "LLMs solve variable-length sequences by making state proportional to input length—a tradeoff worth billions in GPU investments."

Why Karpathy's Insight Still Resonates

Remarkably, Karpathy foresaw attention's importance in his 2015 conclusion, calling it "the most interesting recent architectural innovation." His byte-level approach also highlights how far pretraining has come: modern tokenizers give LLMs initial linguistic structure that RNNs painstakingly learned from noise. Thomas's [PyTorch implementation](https://github.com/gilesgthomas/rnn-from-scratch) offers a hands-on lens into this evolution. When trained on his own blog post, the RNN outputs:

"The RNNs in the sequence of the ware the for the extsting of the network and the frome to the the fard"


...a poetic reminder of how far we've come—and why foundational work remains essential reading for understanding AI's trajectory. As transformer context windows expand into millions of tokens, we're still grappling with variations of the same core challenge: how to remember what matters.

Source: Analysis based on Giles Thomas' Revisiting Karpathy's Unreasonable Effectiveness of RNNs