From Expression Parser to Standalone Executable: Building a Toy Language Compiler in C

What began as a simple math expression parser spiraled into a full compiler project, featuring a custom virtual machine and bytecode serialization to produce self-contained executables. This deep dive explores the journey from ASTs to binary magic, revealing how a stack-based VM and clever build process eliminate external dependencies. For developers, it's a masterclass in compiler design and the unexpected lessons in language implementation.

It started innocently enough—a math expression parser called "numero"—but soon, Vivek N. found himself tumbling down the compiler rabbit hole. Parsing led to abstract syntax trees (ASTs), evaluation hinted at language-like behavior, and before he knew it, he was building a full toy programming language from scratch. Inspired by Thorsten Ball's books, Writing An Interpreter In Go and Writing A Compiler In Go, Vivek decided to rewrite the concepts in C for an added challenge. The result? A minimal but functional compiler that outputs self-contained executables, no VM or runtime required. As detailed in his blog post on vivekn.dev, this project offers profound insights for developers curious about compilers, virtual machines, and binary magic.

The Compiler Journey: From Inspiration to Implementation

Vivek's adventure began with Ball's Go-based guides, which provide a step-by-step blueprint for creating an interpreted language. But copying the code verbatim felt uninspiring. Instead, he ported the entire stack to C, embracing the language's raw control over memory and performance. The goal wasn't to build a production-ready tool—lacking closures and garbage collection, it remains a toy—but to understand the core mechanics. "It doesn't support closures. It doesn't have a garbage collector. It's definitely still a toy," Vivek admits. "But it works! And it compiles to a self-contained executable."

Inside the Virtual Machine: A Stack-Based Core

At the heart of the language is a stack-based virtual machine (VM), chosen for its simplicity over register-based alternatives. The VM handles evaluation through a tight loop, pushing and popping values while executing opcodes. Key components include:

A fixed-size stack for intermediate values (prone to overflows during development).
A constant pool for storing literals like integers and functions.
A globals store for variables.
A frame stack managing function calls, with each frame tracking the compiled function, instruction pointer, and base pointer for locals.

Here's a simplified view of the VM's execution loop, written in C-like pseudocode for clarity:

while (running) {
    Instruction instr = fetch_next_instruction();
    switch (instr.opcode) {
        case OP_ADD: {
            Value b = pop();
            Value a = pop();
            push(a + b);
            break;
        }
        case OP_JUMP_IF_FALSE: {
            Value condition = pop();
            if (!is_truthy(condition)) {
                jump(instr.operand);
            }
            break;
        }
        // ... other opcodes
    }
}

Conditional jumps test truthiness directly from the stack, while call frames enable function invocation by indexing into the evaluation stack relative to the base pointer. This design, though rudimentary, efficiently handles control flow and local scoping.

Compiling to Bytecode: ASTs and Symbol Tables

The compiler transforms the AST into executable bytecode in a single pass, emitting instructions into a buffer while managing scopes and constants. Each compilation scope—such as for functions—maintains its own instruction buffer, allowing nested definitions. Symbol tables use a linked structure for lexical scoping:

Global variables map to indices in the global store.
Locals use stack offsets within their frame.
Constants, including compiled functions, reside in a shared pool, enabling first-class functions.

When encountering a function literal, the compiler spawns a new scope, compiles the body into a separate instruction sequence, and packages it as a CompiledFunction object. This constant is then embedded into the main bytecode, making functions portable and executable.

Crafting Self-Contained Executables: Bytecode Serialization and the Runtime Stub

The real innovation lies in generating binaries users can chmod +x and run. Vivek's approach avoids complex toolchains like linkers or ELF sections, instead serializing bytecode and gluing it to a precompiled runtime stub. The process unfolds in two stages:

Bytecode Serialization: The compiler outputs a ByteCode struct—containing instructions and constants—into a binary format with type tags. Strings include length prefixes, and functions encode their instruction slices.
Runtime Integration: A minimal C program (the "stub") is compiled to include deserialization logic. During build, the compiler appends the serialized bytecode to this stub, creating a single executable.

When launched, the binary reads its own end to locate the bytecode marker, deserializes it, and runs the program via the embedded VM. Vivek notes the hacky elegance: "It’s just pure C and a bit of file manipulation." For instance, a sample MonkeyC program:

let x = 5;
println(x + 10);

Compiles with monkeyc -o program.mon, then builds into an executable. The result is magic in a file—no external dependencies, just a runnable binary.

Reflections and the Path Forward

This project, while limited, underscores how building compilers demystifies complex systems. Vivek gained unexpected lessons in memory management, optimization trade-offs, and the realities of binary formats. "It taught me more about compilers, memory, and language design than I ever expected," he reflects. For developers, it's a reminder that even toy projects can yield deep insights—and sometimes, the journey from a simple parser to a self-contained executable is where true engineering artistry begins.

#CompilerDesign #VirtualMachine #BytecodeSerialization