Article illustration 1

Remember when in-memory databases like Redis felt revolutionary? Meet WILD (Within-cache Incredibly Lightweight Database), which laughs at RAM-based solutions by storing everything in CPU L3 cache. This audacious experiment achieves 3-nanosecond read latencies – faster than a photon travels from your screen to your eye – by treating modern CPUs' multi-megabyte caches as primary storage rather than transient buffers.

Why CPU Cache as Storage?

Modern CPUs contain hierarchical memory structures where speed inversely scales with capacity:

Memory Tier Capacity Range Access Latency
L1 Cache 32-64KB ~1ns
L2 Cache 256KB-1MB 3-10ns
L3 Cache 8-144MB 15-50ns
RAM 8-128GB 100-300ns
SSD/HDD 1TB+ µs-ms range

WILD exploits modern CPUs' sprawling L3 caches (up to 144MB in flagship processors) as persistent storage. As creator Canoozie notes: "Your CPU's cache is basically a tiny, incredibly fast SSD that no one bothered to format." This approach enables:
- Sub-microsecond operations: Database reads complete before kernel scheduler interrupts fire
- Zero-copy data access: Eliminating serialization/deserialization overhead
- NUMA-aware placement: Minimizing cross-socket memory latency penalties

Engineering for the Cache Hierarchy

WILD's architecture reflects deep understanding of CPU internals:

Cache-Line-Optimized Records
Each record fits precisely in a 64-byte cache line, aligned to prevent false sharing:

// Zig implementation of cache-aligned record
pub const CacheLineRecord = extern struct {
    metadata: u32,  // Valid flag + length
    key: u64,       // Hash key
    data: [52]u8,   // Payload
}; // Exactly 64 bytes

NUMA and SMT Consciousness
WILD probes /sys/devices/system/cpu at startup to map cache domains and NUMA nodes. Data structures bind to specific cores, avoiding costly cross-NUMA access. It prioritizes physical cores over hyperthreads to prevent execution unit contention.

Flat Hash Table Design
A linear-probing hash table with power-of-two capacity enables bitmask indexing (3 cycles vs. 30+ for modulo). Wyhash and MurmurHash3 minimize collisions while maintaining cache locality.

The Brutal Tradeoffs

WILD's performance comes at significant cost:
- Zero durability: Data evaporates on eviction or shutdown
- No ACID/transactions: Crash consistency? Forget it
- Capacity ceiling: Limited by L3 cache size (e.g., ~1.5M 64B records in 96MB)
- x86_64/Linux only: No ARM or Windows support

Why This Matters Beyond the Novelty

While impractical for most workloads, WILD demonstrates techniques applicable to serious systems:
1. Database internals: Buffer pools and write-ahead logs benefit from cache-aware structures
2. Real-time systems: HFT and industrial control could adopt these patterns selectively
3. Query optimization: Engineered for L3-resident intermediate results

Benchmarks on a Ryzen 9 7800X3D (96MB L3) show 615.5 million operations/second – orders of magnitude beyond traditional databases. As one commenter observed: "This feels like seeing a Formula 1 car: impractical for groceries, but revolutionary engineering that trickles down to everyday vehicles."

The real value lies in WILD's stark reminder: we leave enormous performance untapped by treating CPU caches as abstracted implementation details. While you shouldn't deploy this to production, its lessons could reshape how we design systems pushing latency boundaries.