When RocksDB's Randomness Test Uncovered a CPU Hardware Bug

A seemingly innocuous unit test in RocksDB's codebase led to the discovery of a serious hardware bug in newer AMD CPUs, demonstrating how software testing can reveal fundamental hardware flaws.

What began as a routine unit test in RocksDB's codebase turned into a high-stakes investigation that uncovered a critical hardware bug affecting newer AMD processors. This story reveals how software testing can sometimes expose fundamental flaws in the hardware we depend on.

The Quest for Reliable Unique Identifiers

Four years ago, RocksDB developers faced a common engineering dilemma: whether to rely on existing solutions or build their own. The team needed stable, unique identifiers for SST files that would work across different filesystems. While operating systems provide file identifiers, some filesystems only guaranteed uniqueness among existing files, not across all files in recent history.

The team decided to implement their own system using random identifiers, specifically 128-bit quasi-random numbers. This approach offered predictability and eliminated dependence on potentially inconsistent filesystem behavior. However, this solution required access to high-quality random numbers, which presented its own challenges.

Balancing Dependencies and Self-Reliance

As a cross-platform database, RocksDB needed to minimize platform-specific dependencies while ensuring reliability. The team combined multiple entropy sources to generate random identifiers:

C++11's std::random_device (intended to provide high-quality randomness but not guaranteed)
Hashes of environment parameters like hostname, process ID, thread ID, and timestamps
Platform-specific UUID generators (Linux and Windows only)

This multi-source approach meant that even if one entropy source failed, the others could compensate. Since RocksDB only needed uniqueness rather than cryptographic security, the team could use a quasi-random approach that minimized entropy requirements.

Building in Trust But Verify

To ensure ongoing reliability, the team added unit tests that created thousands of unique identifiers using each entropy source individually. These tests ran continuously with multiple threads, checking for duplicates. For high-quality sources, the probability of any duplicate 128-bit IDs among thousands was negligible—even over decades of continuous testing.

For four years, these tests ran without incident. Then something strange happened.

When Randomness Stopped Being Random

About a year ago, the test based on std::random_device failed once. This was suspicious because the number of unique IDs was short by dozens or hundreds, not just one. While a random CPU hiccup could explain generating fewer IDs initially, the pattern was concerning.

Then, a month later, the same test failed again. Two failures in two months after four years of perfect operation. The team noticed a crucial correlation: both failed test jobs ran on the same type of newer CPU, though in completely different data centers.

Scaling Up to Find the Bug

The team scaled up the test, increasing thread counts to match core counts. The test now failed quickly and consistently on all systems using the same newer CPU type, while passing on everything else. Further testing revealed that std::random_device using "rdrand" and "/dev/urandom" sources weren't affected, and that only libstdc++ (from GCC) was affected, not libc++ (from clang).

Root Cause: A Hardware-Level Failure

Meta's engineering team investigated the low-level details and discovered the problem: the RDSEED instruction on this processor type would return 0 and "success" much more often than random chance would predict, but only on some cores and only under "complex micro-architectural conditions reproducible under memory-load."

This wasn't a software bug—it was a fundamental hardware flaw affecting the processor's ability to generate random numbers correctly.

The Aftermath and Broader Implications

The discovery was serious enough to warrant a "high severity" CVE assignment. A Linux kernel patch was developed to signal that RDSEED was unavailable on these processors, with Meta planning internal rollout while waiting for an OEM fix. AMD quickly acknowledged the issue and announced plans for a CPU microcode update.

However, the incident highlighted coordination challenges. Uncoordinated disclosure via the Linux mailing list occurred due to zealous remediation efforts across multiple infrastructure teams at Meta. The team acknowledged this mistake and is working to improve controls on processes that failed to coordinate with the OEM first.

Key Lessons for Software Engineering

This incident offers several valuable takeaways for software engineers:

Test what you depend on: The unit test that revealed this bug was originally added to verify the quality of random number sources. Without these tests, the hardware bug might have gone undetected for much longer.

Build in redundancies and sanity checks: By combining multiple entropy sources and testing each independently, RocksDB was able to detect when one source failed. This approach caught a hardware bug that might have otherwise caused subtle data corruption.

Even CPUs can have bugs: While most CPU bugs affect individual units or specific conditions, this incident shows that fundamental hardware flaws can exist. The investment in redundant checks and ongoing verification paid off by catching the problem before it could cause more serious issues.

The balance between reuse and self-reliance: The original decision to implement custom unique identifiers, rather than relying solely on filesystem-provided IDs, created the conditions that eventually revealed this hardware bug. Sometimes building your own solution, even when existing options exist, can provide valuable independence and early warning of systemic issues.

This story demonstrates how thorough testing, thoughtful architecture decisions, and persistent investigation can uncover problems that span the hardware-software boundary—problems that might otherwise remain hidden until they cause more serious failures.

#RocksDB #AMD #Hardware bug #CVE #Software Testing

When RocksDB's Randomness Test Uncovered a CPU Hardware Bug

Comments