WILD: The Database That Lives Entirely in Your CPU's L3 Cache
Share this article
Remember when in-memory databases like Redis felt revolutionary? Meet WILD (Within-cache Incredibly Lightweight Database), which laughs at RAM-based solutions by storing everything in CPU L3 cache. This audacious experiment achieves 3-nanosecond read latencies – faster than a photon travels from your screen to your eye – by treating modern CPUs' multi-megabyte caches as primary storage rather than transient buffers.
Why CPU Cache as Storage?
Modern CPUs contain hierarchical memory structures where speed inversely scales with capacity:
| Memory Tier | Capacity Range | Access Latency |
|---|---|---|
| L1 Cache | 32-64KB | ~1ns |
| L2 Cache | 256KB-1MB | 3-10ns |
| L3 Cache | 8-144MB | 15-50ns |
| RAM | 8-128GB | 100-300ns |
| SSD/HDD | 1TB+ | µs-ms range |
WILD exploits modern CPUs' sprawling L3 caches (up to 144MB in flagship processors) as persistent storage. As creator Canoozie notes: "Your CPU's cache is basically a tiny, incredibly fast SSD that no one bothered to format." This approach enables:
- Sub-microsecond operations: Database reads complete before kernel scheduler interrupts fire
- Zero-copy data access: Eliminating serialization/deserialization overhead
- NUMA-aware placement: Minimizing cross-socket memory latency penalties
Engineering for the Cache Hierarchy
WILD's architecture reflects deep understanding of CPU internals:
Cache-Line-Optimized Records
Each record fits precisely in a 64-byte cache line, aligned to prevent false sharing:
// Zig implementation of cache-aligned record
pub const CacheLineRecord = extern struct {
metadata: u32, // Valid flag + length
key: u64, // Hash key
data: [52]u8, // Payload
}; // Exactly 64 bytes
NUMA and SMT Consciousness
WILD probes /sys/devices/system/cpu at startup to map cache domains and NUMA nodes. Data structures bind to specific cores, avoiding costly cross-NUMA access. It prioritizes physical cores over hyperthreads to prevent execution unit contention.
Flat Hash Table Design
A linear-probing hash table with power-of-two capacity enables bitmask indexing (3 cycles vs. 30+ for modulo). Wyhash and MurmurHash3 minimize collisions while maintaining cache locality.
The Brutal Tradeoffs
WILD's performance comes at significant cost:
- Zero durability: Data evaporates on eviction or shutdown
- No ACID/transactions: Crash consistency? Forget it
- Capacity ceiling: Limited by L3 cache size (e.g., ~1.5M 64B records in 96MB)
- x86_64/Linux only: No ARM or Windows support
Why This Matters Beyond the Novelty
While impractical for most workloads, WILD demonstrates techniques applicable to serious systems:
1. Database internals: Buffer pools and write-ahead logs benefit from cache-aware structures
2. Real-time systems: HFT and industrial control could adopt these patterns selectively
3. Query optimization: Engineered for L3-resident intermediate results
Benchmarks on a Ryzen 9 7800X3D (96MB L3) show 615.5 million operations/second – orders of magnitude beyond traditional databases. As one commenter observed: "This feels like seeing a Formula 1 car: impractical for groceries, but revolutionary engineering that trickles down to everyday vehicles."
The real value lies in WILD's stark reminder: we leave enormous performance untapped by treating CPU caches as abstracted implementation details. While you shouldn't deploy this to production, its lessons could reshape how we design systems pushing latency boundaries.