microgpt: A 200-Line Pure Python GPT Implementation

A comprehensive guide to microgpt, Andrej Karpathy's minimalist GPT implementation that distills the essence of large language models into a single, dependency-free Python file.

This article explores microgpt, Andrej Karpathy's remarkable achievement in simplifying the GPT architecture to its bare essentials. The project demonstrates that the core algorithmic content of a GPT model can be expressed in just 200 lines of pure Python with no external dependencies, while still capturing the fundamental mechanics of how large language models operate.

The Essence of Simplicity

At its core, microgpt represents a decade-long obsession to distill large language models to their fundamental components. The implementation includes everything necessary: a dataset of documents, a tokenizer, an autograd engine, a GPT-2-like neural network architecture, the Adam optimizer, a training loop, and an inference loop. Everything else is merely efficiency optimizations.

The beauty of this approach lies in its pedagogical value. By stripping away all the complexity and optimizations that characterize production systems, microgpt reveals the pure algorithmic essence of what makes GPT models work. As Karpathy notes, this is the culmination of multiple projects including micrograd, makemore, and nanogpt, each contributing to this final distillation.

The Dataset: Learning from Names

The training data for microgpt consists of approximately 32,000 names, one per line, downloaded from a public repository. This simple dataset serves as an ideal playground for demonstrating the model's capabilities without requiring massive computational resources.

Each name in the dataset becomes a document that the model learns to complete. The goal is for the model to understand the statistical patterns within these names and generate new, plausible-sounding names that share similar characteristics. After training, the model produces outputs like "kamon," "ann," "karai," and "jaire" - names that, while not real, sound convincingly like they could be.

This approach mirrors how larger language models work. When you interact with ChatGPT, your conversation is essentially a document from the model's perspective. The model's response is simply a statistical completion of that document, just as microgpt completes names.

Tokenization: From Characters to Numbers

Neural networks operate on numbers, not text, so tokenization is the crucial first step in processing any language data. microgpt uses the simplest possible tokenizer: it assigns a unique integer ID to each character in the dataset.

For the names dataset, this results in 27 tokens total - one for each lowercase letter a-z, plus a special Beginning of Sequence (BOS) token. The BOS token serves as a delimiter, marking where new documents start and end. During training, each document gets wrapped with BOS tokens on both sides, allowing the model to learn document boundaries.

This approach contrasts sharply with production tokenizers like tiktoken, which operate on chunks of characters for efficiency. While microgpt's character-level tokenization is less efficient, it dramatically simplifies the implementation while still demonstrating the core concepts.

Autograd: Computing Gradients from Scratch

The autograd engine is perhaps the most mathematically intensive component of microgpt. It implements automatic differentiation through a custom Value class that tracks how each scalar value is computed and how changes propagate backward through the computation graph.

The Value class wraps a single scalar number and maintains several key pieces of information: the actual data value, the gradient (initialized to zero), references to its child nodes in the computation graph, and the local gradients for each operation. Every mathematical operation on Value objects creates a new Value that remembers its inputs and the local derivative of that operation.

For example, when multiplying two Value objects, the implementation records that the local gradient of a*b with respect to a is b, and with respect to b is a. During the backward pass, these local gradients are combined using the chain rule to compute the total gradient of the loss with respect to each parameter.

This implementation is functionally identical to what libraries like PyTorch provide, but operates on scalars instead of tensors. The algorithm is the same - just significantly simpler and less efficient. A concrete example demonstrates this equivalence: when computing L = a*b + a with a=2 and b=3, the gradients are dL/da = 4 and dL/db = 2, matching what PyTorch would produce.

The Neural Network Architecture

microgpt implements a GPT-2-like architecture with several simplifications for clarity: RMSNorm instead of LayerNorm, no biases, and ReLU instead of GeLU activation. The model processes one token at a time, taking as input a token ID, a position ID, cached keys and values from previous positions, and the model parameters.

The architecture follows the classic Transformer pattern of alternating attention and MLP blocks. Each token first gets embedded using learned token and position embeddings, which are then added together. The resulting vector passes through multiple layers, each containing:

Multi-head attention: The current token is projected into query, key, and value vectors. Each attention head computes dot products between its query and all cached keys, applies softmax to get attention weights, and takes a weighted sum of the cached values. The outputs of all heads are concatenated and projected through a final weight matrix.
MLP block: A two-layer feed-forward network that projects up to 4x the embedding dimension, applies ReLU, and projects back down. This is where the model does most of its per-position computation.
Residual connections: Both blocks add their output back to their input, allowing gradients to flow directly through the network and making deeper models trainable.

The final hidden state is projected to vocabulary size, producing one logit per token. Higher logits indicate the model thinks that token is more likely to come next.

Training: The Learning Process

The training loop repeatedly selects a document, tokenizes it, runs the model forward over its tokens, computes a loss, backpropagates to get gradients, and updates the parameters using the Adam optimizer.

Each training step processes one document wrapped with BOS tokens. The model predicts each next token given the tokens before it, and the loss at each position is the negative log probability of the correct next token. This cross-entropy loss measures how surprised the model is by what actually comes next - lower loss means better predictions.

The Adam optimizer maintains two running averages per parameter: m tracks the mean of recent gradients (momentum), and v tracks the mean of recent squared gradients (adapting the learning rate per parameter). The learning rate decays linearly over training, and after each update, gradients are reset to zero for the next step.

Over 1,000 training steps, the loss decreases from around 3.3 (random guessing among 27 tokens) down to around 2.37, demonstrating that the model is learning the statistical patterns of names.

Inference: Generating New Names

Once training is complete, the model can generate new names by sampling from its learned probability distribution. Starting with the BOS token, the model produces logits for each possible next token, converts them to probabilities using softmax, and samples one token according to those probabilities. This token becomes the next input, and the process repeats until the model produces BOS again or reaches the maximum sequence length.

The temperature parameter controls the randomness of generation. Before softmax, logits are divided by the temperature. A temperature of 1.0 samples directly from the model's learned distribution. Lower temperatures (like 0.5) sharpen the distribution, making the model more conservative and likely to pick its top choices. Higher temperatures flatten the distribution and produce more diverse but potentially less coherent output.

From Micro to Macro: The Path to Production

While microgpt captures the algorithmic essence of GPT models, there's a vast gap between this 200-line implementation and production systems like ChatGPT. None of these differences alter the core algorithm, but they're what makes it actually work at scale:

Data: Production models train on trillions of tokens of internet text rather than 32,000 names. The data is carefully deduplicated, filtered for quality, and mixed across domains.

Tokenizer: Instead of character-level tokenization, production models use subword tokenizers like BPE, which learn to merge frequently co-occurring character sequences into single tokens. This gives a vocabulary of ~100K tokens and is much more efficient.

Autograd: microgpt operates on scalar Value objects in pure Python, while production systems use tensors and run on GPUs/TPUs that perform billions of floating point operations per second.

Architecture: microgpt has 4,192 parameters, while GPT-4 class models have hundreds of billions. Modern LLMs incorporate additional components like RoPE embeddings, GQA, gated linear activations, and Mixture of Experts layers.

Training: Production training uses large batches, gradient accumulation, mixed precision, and careful hyperparameter tuning. Training a frontier model takes thousands of GPUs running for months.

Optimization: At scale, optimization becomes its own discipline, with models training in reduced precision and across large GPU clusters, requiring precise tuning of learning rates, weight decay, and other hyperparameters.

Post-training: The base model that comes out of training is a document completer, not a chatbot. Turning it into ChatGPT requires supervised fine-tuning on curated conversations and reinforcement learning from human feedback.

Inference: Serving a model to millions of users requires batching requests, KV cache management, speculative decoding, quantization, and distributing the model across multiple GPUs.

Philosophical Questions

Does the model "understand" anything? Mechanically, no magic is happening. The model is a big math function that maps input tokens to a probability distribution over the next token. During training, parameters are adjusted to make the correct next token more probable. Whether this constitutes "understanding" is up to you, but the mechanism is fully contained in the 200 lines above.

Why does it work? The model has thousands of adjustable parameters, and the optimizer nudges them a tiny bit each step to make the loss go down. Over many steps, parameters settle into values that capture the statistical regularities of the data. For names, this means things like: names often start with consonants, "qu" tends to appear together, names rarely have three consonants in a row, etc.

What's the deal with "hallucinations"? The model generates tokens by sampling from a probability distribution. It has no concept of truth, it only knows what sequences are statistically plausible given the training data. microgpt "hallucinating" a name like "karia" is the same phenomenon as ChatGPT confidently stating a false fact. Both are plausible-sounding completions that happen not to be real.

Getting Started

Running microgpt requires only Python - no pip install, no dependencies. The script takes about one minute to run on a MacBook, showing the loss decrease from ~3.3 to ~2.37 over 1,000 steps. At the end, it generates 20 new, hallucinated names that demonstrate the model has learned the statistical patterns of the training data.

The code is available as a GitHub gist and can also be run directly in a Google Colab notebook. The implementation is designed to be modified and experimented with - try different datasets, train for longer, or increase the model size to see how the results improve.

Conclusion

microgpt represents a remarkable achievement in educational software engineering. By distilling the GPT architecture to its bare essentials, it provides an accessible entry point for understanding how large language models work. The implementation demonstrates that beneath all the complexity and scale of modern AI systems lies a relatively simple core algorithm that can be expressed in just a few hundred lines of code.

For anyone interested in understanding the fundamentals of neural networks and language models, microgpt offers an unparalleled learning opportunity. It strips away the magic and reveals the mechanical reality: a series of mathematical operations that, when scaled up and trained on massive datasets, produce the seemingly intelligent behavior we associate with modern AI systems.

The project serves as both a technical achievement and a philosophical statement about the nature of AI. It shows that the "intelligence" we observe in large language models emerges from relatively simple principles applied at scale, rather than from any fundamental breakthrough in understanding intelligence itself. In this sense, microgpt is not just a teaching tool, but a window into the true nature of artificial intelligence.

#GPT #Python #Autograd #transformer #LLM