Google's Temeraire: How a Hugepage-Aware Allocator Rewrites Memory Efficiency at Scale
Share this article
Imagine trimming 5% off your cloud bill while accelerating applications—without rewriting a single line of application code. That’s the promise of Temeraire, Google’s hugepage-aware memory allocator integrated into tcmalloc. Born from research presented at OSDI ’21, Temeraire isn’t just another allocator tweak. It’s a masterclass in systemic optimization, revealing how Google tackles fleet-wide efficiency by confronting the Translation Lookaside Buffer (TLB) bottleneck head-on.
The Hugepage Imperative
Modern systems rely on virtual memory, where the TLB caches mappings between virtual and physical addresses. Each 4KB page translation consumes a TLB entry. Hugepages (typically 2MB) reduce this overhead dramatically—one entry covers 512x more memory. Linux’s Transparent Hugepages (THP) automates this, but allocators often sabotage it. As the paper notes:
"If you have a single 8-byte allocation on a 2 MiB hugepage, you waste 99.999% of memory."
Traditional allocators prioritize minimizing fragmentation within individual pages. Temeraire flips this: it optimizes for hugepage coverage, ensuring allocations align to maximize THP usage. The payoff? Fewer TLB misses, higher Instructions Per Cycle (IPC), and tangible throughput gains.
Profiling the Fleet: Beyond Microbenchmarks
Temeraire’s brilliance lies not just in design but in validation. Allocator microbenchmarks are notoriously misleading—optimizations that look regressive locally might win globally. Google sidestepped this with Google-Wide Profiling (GWP):
- Continuously samples CPU metrics (e.g., TLB misses, IPC) across all production workloads.
- Enables measuring real-world impact, not proxy metrics.
- Revealed that Temeraire’s "costly" allocation heuristics improved fleet performance despite microbenchmark penalties.
As the authors argue:
"Microbenchmark results alone can’t be trusted... GWP allowed us to measure how changes affect actual production performance."
This tooling transformed development. Engineers used an "empirical driver"—a synthetic workload generator based on fleet telemetry—for rapid iteration. Promising candidates graduated to full A/B tests in production via a custom framework, confirming gains like 5% higher requests/second and 8% lower RAM usage across diverse services.
Temeraire’s Architecture: Heuristics for the Real World
Temeraire’s core challenge: minimize wasted space while maximizing hugepage eligibility. Its four components tackle this via smart heuristics:
HugeFiller: Manages sub-1MB allocations, using a "best-fit" strategy prioritizing pages with:
- Smallest longest-free range (to minimize fragmentation).
- Most allocations (inspired by the "Radioactive Decay" model—pages with many small objects free faster).
HugeCache: Handles mid-sized allocations (1MB–1GB).
HugeAllocator: Fetches hugepage-aligned virtual addresses from the OS.
HugeRegion: Mitigates "slack" (unused memory between allocations) for pathological 1–2MB cases.
Pseudocode for allocation decisions:
Span New(N) {
if (N >= 1 GiB) return HugeCache.New(N);
if (N <= 1 MiB) return HugeFiller.New(N);
if (N < 2 MiB) {
Span s = HugeFiller.TryAllocate(N);
if (s != NULL) return s;
}
Span s = HugeRegion.TryAllocate(N);
if (s != NULL) return s;
s = HugeCache.New(N);
HugeFiller.DonateTail(s); // Repurpose leftover space
return s;
}
The Telemetry Flywheel: Lessons Beyond Allocation
Temeraire’s success underscores a broader truth: system software thrives on observability. The telemetry pipeline built for it—sampling allocations/free patterns across the fleet—became a "powerful tool for designing allocators," revealing optimization opportunities invisible otherwise. This reinforces a critical engineering maxim: invest in instrumentation early, and it compounds in value.
While Temeraire delivered impressive results (50% fewer page-table walks, 5% RPS gains), its legacy is methodological. By measuring fleet efficiency—not just allocator speed—and embracing production experimentation, Google turned memory allocation into a vector for datacenter-wide wins. For engineers, the takeaway is clear: sometimes, the biggest optimizations come from seeing the forest, not just the trees.
Source: Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator (OSDI '21)