StringZilla Unleashed: AVX‑512‑Powered Unicode Case‑Insensitive Search
Share this article
StringZilla Unleashed: AVX‑512‑Powered Unicode Case‑Insensitive Search
The world’s most widely used text‑encoding standard, UTF‑8, has grown from a niche solution in 1989 to the de‑facto backbone of the web, now covering more than a million codepoints. Its variable‑length nature, however, makes high‑performance processing a moving target: a single character can occupy one to four bytes, and many characters expand into multiple codepoints when folded for case‑insensitive comparisons.
The Challenge of Unicode‑Aware Search
Traditional libraries such as ICU provide robust case‑folding, but they lack a dedicated, fast, case‑insensitive substring search API. Developers typically resort to a two‑step pipeline: fold the haystack and needle, then run a naive memmem or regular‑expression search. This approach is not only slow—often limited by the folding stage—but also error‑prone, as offsets can shift when expansions occur.
StringZilla’s Design Goals
- Correctness – Full compliance with Unicode 17’s 1,400+ case‑folding rules, including multi‑character expansions and locale‑agnostic mappings.
- Speed – Exploit AVX‑512 on Intel and AMD CPUs to accelerate tokenization, case‑folding, and substring search.
- Portability – Provide language bindings for C/C++, Rust, Python, Swift, Node.js, Go, and more, all sharing the same kernel implementations.
- Extensibility – A modular architecture that allows new script kernels (e.g., Georgian, Korean) to be added without rewriting core logic.
How It Works
1. Safe Window Selection
Instead of folding the entire needle, StringZilla scans for a safe 16‑byte window that is guaranteed to be fold‑stable. The algorithm walks the needle by UTF‑8 codepoints, folds forward, and stops when the window reaches 16 bytes or encounters an alarm—a character that would shrink or expand under folding. By focusing on this window, the library sidesteps the expensive full‑fold path for most needles.
2. Script‑Aware SIMD Kernels
The library partitions Unicode into script‑ish buckets (ASCII, Western European, Cyrillic, Greek, etc.) and assigns each a dedicated AVX‑512 kernel. These kernels use a mix of ternary logic, byte‑table shuffles, and range checks to compare the haystack against the probe bytes. For example, the Cyrillic kernel uses a single VPSHUFB lookup to map the high nibble of the second UTF‑8 byte to the required offset, avoiding branchy logic.
3. Verification and Fallbacks
Every SIMD hit is followed by a lightweight verifier that checks the preceding and following bytes to ensure a true match. If a probe triggers an alarm or the SIMD path fails, the library falls back to a serial, Unicode‑correct algorithm—either a hash‑free scan for short needles or a Rabin‑Karp style rolling hash for longer ones.
Benchmarks
The authors evaluated StringZilla against ICU, PCRE2, and the Python regex module using the Leipzig Wikipedia corpora (≈ 100 MB per language). The results, reproduced from the library’s own benchmark suite, show:
| Language | StringZilla (GB/s) | ICU (GB/s) | PCRE2 (GB/s) |
|---|---|---|---|
| English | 10.9× | 0.081 | 1.42 |
| German | 13.6× | 0.081 | 1.24 |
| Russian | 10.6× | 0.147 | 0.257 |
| Arabic | 13.7× | 0.193 | 0.253 |
Across the board, StringZilla delivers 50–150× speedups over the serial ICU pipeline and up to 20–150× over PCRE2 for case‑insensitive searches. The library also achieves 5–15 GB/s throughput on cached data, approaching the limits of NVMe SSD read speeds.
Usage Across Languages
// C++ example
namespace sz = ashvardanian::stringzilla;
auto [offset, len] = sz::string_view("Der große Hund").utf8_case_insensitive_find("GROSSE");
# Python example
import stringzilla as sz
print(sz.utf8_case_insensitive_find("Der große Hund", "GROSSE")) # 4
// Rust example
use stringzilla::stringzilla::{utf8_case_insensitive_find, Utf8CaseInsensitiveNeedle};
let needle = Utf8CaseInsensitiveNeedle::new(b"STRASSE");
assert_eq!(utf8_case_insensitive_find("Straße", &needle), Some((0, 7)));
The library ships as a pre‑compiled wheel for Python, a static header‑only library for C/C++, and a Cargo crate for Rust, making it easy to drop into existing codebases.
Future Directions
- ARM Support – Porting the kernels to NEON/SVE will broaden the library’s reach to mobile and embedded devices.
- New Script Kernels – Georgian, Korean, and other high‑frequency scripts are on the roadmap.
- GPU Acceleration – A parallel extension for NVIDIA GPUs is available under the
stringzilla-cudapackage.
Conclusion
StringZilla demonstrates that the bottleneck in Unicode‑aware text processing is not the encoding itself but the lack of hardware‑accelerated, spec‑compliant algorithms. By marrying a meticulous treatment of Unicode case‑folding with AVX‑512 SIMD kernels, the library offers developers a drop‑in replacement that is both faster and more correct than existing solutions.
Source: https://ashvardanian.com/posts/search-utf8/