Emulating AVX-512 Intrinsics in Miri: Boosting Rust's SIMD Testing Without Custom Hardware

Rust's rise in systems programming owes much to its blend of memory safety and low-level control, particularly in performance-sensitive areas like data compression. The zlib-rs library exemplifies this, evolving from a 2023 prototype to a mature tool leveraging Intel's AVX-512 instruction set. These 512-bit SIMD vectors promise substantial speedups for algorithms central to file compression, yet integrating them introduces testing challenges in environments without compatible hardware.

Article illustration 1

From Unstable Features to Production Readiness

When zlib-rs began, AVX-512 support in Rust was experimental, and access to capable hardware was scarce. Fast-forward to late 2025: Rust 1.89 has stabilized the necessary target features and intrinsics, aligning with wider availability of AVX-512-enabled processors. This maturity allowed the team to port three key algorithms from the optimized zlib-ng codebase:

  • compare256: Scans for substring matches, a bottleneck in decompression that thrives on wider vectors for parallel comparisons.
  • CRC32: Computes checksums for .gz archives, now processing more data per cycle.
  • Adler32: Handles checksums for standard zlib streams, similarly accelerated.

Idiomatic Rust translations—employing slices and iterators—streamlined these ports, often yielding cleaner code than their C origins. Real-hardware benchmarks validated the performance uplift, but continuous integration remained problematic. GitHub's standard runners cap at AVX2, and provisioning custom AVX-512 machines for CI seemed inefficient.

Emulation as the Path Forward

The solution lay in emulation, focusing on behavioral fidelity rather than raw speed for CI validation. QEMU, already used for cross-target testing (e.g., s390x), was a logical starting point. Yet its TCG mode rejects AVX-512 features outright, flagging unsupported extensions like avx512f, avx512bw, and vpclmulqdq. Emulating the full AVX-512 suite—hundreds of instructions—would be a herculean effort, unfit for a targeted need.

Intel's emulator, employed in rust-lang/stdarch tests, supports AVX-512 but at a glacial pace, unsuitable for routine CI. Enter Miri, Rust's dynamic analysis tool for detecting undefined behavior. With existing AVX2 coverage, Miri offered an extensible foundation. The zlib-rs team proposed contributing just the intrinsics required, capitalizing on shared logic with narrower AVX2 implementations.

Extending Miri for Targeted AVX-512 Support

The contributions added emulation for four intrinsics, plus leveraging existing support for vpclmulqdq (carry-less multiplication). Each builds on AVX2 patterns, minimizing new code:

  • _mm512_sad_epu8: Sums absolute differences across 64-byte vectors, treating them as 8x8 u8 matrices. Vital for compare256's distance calculations.
  • _mm512_ternarylogic_epi32: Performs bit-wise operations on three 32-bit vectors, indexing an 8-bit mask per column for functions like AND or XOR.
  • _mm512_maddubs_epi16: Merges unsigned 8-bit multiplications with additions, optimizing checksum accumulation.
  • _mm512_permutexvar_epi32: Permutes 32-bit elements based on runtime indices, aiding dynamic data shuffling.

These live in Miri's src/shims/x86/avx512.rs, using generic helpers for vector-width agnosticism. Testing emphasized edge cases, with hardware cross-checks and enhancements to stdarch's suite—particularly for _mm512_ternarylogic_epi32, where prior tests fell short.

Dissecting the _mm512_sad_epu8 Implementation

Take the sum-of-absolute-differences intrinsic, via rust-lang/miri#4686. The shim intercepts the LLVM moniker (psad.bw.512), enforces the avx512bw feature, and invokes a helper:

"psad.bw.512" => {
    this.expect_target_feature_for_intrinsic(link_name, "avx512bw")?;

    let [left, right] =
        this.check_shim_sig_lenient(abi, CanonAbi::C, link_name, args)?;

    psadbw(this, left, right, dest)?
}

The psadbw helper projects operands to SIMD arrays, verifies dimensions (e.g., u8x64 inputs yielding u64x8 outputs), and iterates:

for i in 0..dest_len {
    let dest = ecx.project_index(&dest, i)?;

    let mut acc: u16 = 0;
    for j in 0..8 {
        let src_index = i.strict_mul(8).strict_add(j);

        let left = ecx.project_index(&left, src_index)?;
        let left = ecx.read_scalar(&left)?.to_u8()?;

        let right = ecx.project_index(&right, src_index)?;
        let right = ecx.read_scalar(&right)?.to_u8()?;

        acc = acc.strict_add(left.abs_diff(right).into());
    }

    ecx.write_scalar(Scalar::from_u64(acc.into()), &dest)?;
}

Miri's strict_add and explicit casts trap overflows, upholding Rust's invariants. Development involved iterative refinement, guided by Intel's intrinsics docs, to match hardware semantics precisely.

Implications for Rust's SIMD Ecosystem

This work resolves zlib-rs's CI constraints while enriching Miri for the community. Projects tackling AVX-512 can now test rigorously on commodity hardware, democratizing access to SIMD optimizations. The changes ship in zlib-rs 0.5.3 and libz-rs-sys, gated behind Rust 1.89+ and flags like -Ctarget-feature=+avx512vl,+avx512bw—easily toggled with -Ctarget-cpu=native.

Beyond compression, this precedent signals Rust's adaptability: targeted tool enhancements outpace broad emulator overhauls, empowering developers in fields from cryptography to machine learning. As SIMD architectures advance, such contributions ensure Rust remains at the vanguard of safe, performant code.

This article is based on 'Emulating AVX-512 Intrinsics in Miri' by Folkert de Vries, published by Trifecta Tech Foundation on December 9, 2025. Source: trifectatech.org/blog/emulating-avx-512-intrinsics-in-miri/.