The Elegant Simplicity of smalloc: A 350-Line Memory Allocator That Competes with Giants
#Rust

The Elegant Simplicity of smalloc: A 350-Line Memory Allocator That Competes with Giants

Tech Essays Reporter
6 min read

A new Rust-based memory allocator called smalloc demonstrates that minimalist design can rival complex, production-tested systems like jemalloc and mimalloc, achieving comparable or better performance through radical simplicity and a novel approach to virtual memory management.

Featured image

In the landscape of systems programming, memory allocation is a foundational concern that has traditionally been addressed with increasing layers of complexity. The dominant allocators in production environments—glibc's ptmalloc2, jemalloc, mimalloc, snmalloc, and rpmalloc—are marvels of engineering, often comprising thousands of lines of code to handle intricate scenarios like thread-local caching, size-class optimization, and security hardening. Yet, a new project named smalloc presents a compelling counter-narrative: what if the most effective solution is not more complexity, but radical simplicity?

Developed by Zooko Wilcox-O'Hearn, smalloc is a memory allocator written in Rust that spans a mere 350 lines of implementation code. This is not a toy or a proof-of-concept; it is a serious contender designed as a drop-in replacement for the aforementioned allocators. Its performance benchmarks, particularly against the simd-json library and in high-thread-count scenarios, show it performing competitively, and in many cases, significantly faster than its established rivals. The core thesis of smalloc is that by embracing the near-limitless resource of virtual address space and avoiding the traditional goal of minimizing its reservation, one can achieve both simplicity and efficiency.

The Core Philosophy: Virtual Address Space is Free

The foundational insight of smalloc is that reserving virtual memory address space is a cheap operation. Modern operating systems provide vast virtual address spaces (e.g., 48-bit or 64-bit), and merely reserving a range of addresses—telling the kernel "I might use this later"—consumes negligible resources compared to the act of actually touching (reading or writing to) that memory, which triggers page faults, TLB pressure, and potential swap activity. Traditional allocators often treat virtual address space as a scarce resource to be meticulously managed, leading to complex data structures for tracking free regions.

smalloc inverts this logic. It pre-reserves a huge, contiguous swathe of virtual address space at startup. Within this reserved space, it lays out a sparse, predictable structure of "slabs" and "slots" without touching any physical memory. This sparseness is the key that unlocks its simplicity. Because the layout is predetermined and sparse, finding a free slot of the required size becomes a matter of calculation rather than search. The allocator doesn't need to scan complex data structures to find a fit; it computes a location directly.

Data Model: Slots, Slabs, and Size Classes

All memory managed by smalloc is organized into fixed-size "slabs," each containing an array of fixed-length "slots." Each slot belongs to a specific "size class," where each class's slot size is double the previous (e.g., 4 bytes, 8 bytes, 16 bytes, etc.). For each size class, smalloc maintains 64 separate slabs. This multiplicity of slabs per size class is a critical performance optimization, allowing threads to operate on different slabs to minimize contention.

The genius of the design lies in how it manages free slots. Instead of maintaining separate metadata structures, smalloc uses an intrusive free list. When a slot is free, the memory that would otherwise hold user data is repurposed to store the slot number of the next free slot in the same slab. The head of this free list for each slab is stored in a single variable called the "free-list head" (flh). This means the entire state of the allocator for a given slab can be described by just the flh and the linked list embedded within the free slots themselves.

Algorithms and Thread Safety

The algorithms for malloc, free, and realloc are strikingly straightforward, built upon this simple data model.

  • Allocation (malloc): To allocate, smalloc calculates the appropriate size class and selects a slab (typically the one assigned to the current thread). It then atomically pops the first entry from the slab's free list using a compare-and-exchange loop on the flh. The popped slot is returned to the caller.
  • Deallocation (free): To free, smalloc reverses the process. It calculates the slab and slot number from the pointer, reads the current flh, writes that value into the freed slot (linking it to the previous head), and atomically updates the flh to point to the newly freed slot.
  • Reallocation (realloc): For growing allocations, smalloc employs a heuristic to minimize copying. If the new size fits in the current slot, it does nothing. If the new size is small enough to pack multiple allocations into a page, it allocates a new slot of that size. For very large allocations (e.g., >4 MiB), it allocates a very large slot to provide ample room for future growth, reducing the need for subsequent relocations.

Thread safety is achieved without locks. The only shared state requiring synchronization is the flh for each slab. smalloc uses atomic compare-and-exchange operations to update the flh, ensuring progress even under contention. To prevent ABA problems (where a slot is popped, freed, and pushed back before another thread's operation completes), a counter is embedded in the high bits of the flh, which is incremented on each push attempt.

Performance and Trade-offs

The performance claims are substantiated by benchmarks. In a high-thread-count test (128 threads) from smalloc's own benchmark suite, it showed a 99% improvement over jemalloc and a 70% improvement over mimalloc. In the simd-json benchmarks, smalloc consistently ranked at or near the top, with normalized work completion times showing a 19.6% improvement over the baseline allocator.

This performance stems from several factors:

  1. Minimal Code Paths: With only 350 lines, the code executed for each allocation/deallocation is tiny, reducing instruction cache pressure.
  2. Fewer Cache Misses: The sparse, pre-computed layout means the allocator needs to load very little data from memory. The common case for malloc involves at most three potential cache misses (thread-local slab number, slab's flh, and the intrusive free list entry).
  3. No Lock Contention: The lock-free design avoids the latency spikes and priority inversion issues that can plague lock-based allocators under high contention.

However, smalloc makes deliberate trade-offs that limit its applicability:

  • Allocation Limits: It cannot allocate more than 2 GiB in a single call. It also has a complex, tiered limit on the number of simultaneous large allocations (e.g., only 192 allocations >1 GiB). If these limits are reached, malloc returns null immediately, rather than falling back to a system allocator. This prioritizes consistent, predictable failure modes over graceful degradation.
  • Single Instance: Only one instance of smalloc can exist per process.
  • No Security Hardening: The project explicitly states it has no features for hardening against memory exploitation, making it unsuitable for security-critical applications where allocators like hardened malloc are required.

Implications and Future Directions

smalloc challenges the assumption that production-grade allocators must be monolithic and complex. It demonstrates that a clear, principled design focused on a specific set of goals—simplicity, consistent performance, and cache efficiency—can yield a tool that is both elegant and highly effective for its target use cases. Its primary target is Rust applications, where predictable performance and low overhead are valued.

The project's roadmap includes practical ports (to Windows, iOS, Android) and more experimental ones (to Zig, Odin, or even WebAssembly). There's also interest in integrating with debugging tools like Valgrind and exploring alternative strategies like FIFO free lists or redzones for debugging. Each of these directions would add complexity, and the maintainer's challenge will be to weigh these enhancements against the core value of simplicity.

In the end, smalloc serves as a refreshing reminder in systems engineering: sometimes, the most powerful lever is not to add more features, but to question the foundational constraints we assume. By treating virtual address space as abundant rather than scarce, smalloc opens a path to an allocator that is not just fast, but also a pleasure to read, understand, and maintain.

Links & Resources:

Comments

Loading comments...