A new open‑source library, StringZilla, delivers up to 150× speedups over ICU for Unicode case‑insensitive substring search by leveraging AVX‑512 kernels and a rigorous, spec‑compliant folding engine. The article explains how the library navigates the complexities of UTF‑8, showcases performance benchmarks across languages, and outlines future extensions for ARM and other scripts.

StringZilla Unleashed: AVX‑512‑Powered Unicode Case‑Insensitive Search

The world’s most widely used text‑encoding standard, UTF‑8, has grown from a niche solution in 1989 to the de‑facto backbone of the web, now covering more than a million codepoints. Its variable‑length nature, however, makes high‑performance processing a moving target: a single character can occupy one to four bytes, and many characters expand into multiple codepoints when folded for case‑insensitive comparisons.

The Challenge of Unicode‑Aware Search

Traditional libraries such as ICU provide robust case‑folding, but they lack a dedicated, fast, case‑insensitive substring search API. Developers typically resort to a two‑step pipeline: fold the haystack and needle, then run a naive memmem or regular‑expression search. This approach is not only slow—often limited by the folding stage—but also error‑prone, as offsets can shift when expansions occur.

StringZilla’s Design Goals

Correctness – Full compliance with Unicode 17’s 1,400+ case‑folding rules, including multi‑character expansions and locale‑agnostic mappings.
Speed – Exploit AVX‑512 on Intel and AMD CPUs to accelerate tokenization, case‑folding, and substring search.
Portability – Provide language bindings for C/C++, Rust, Python, Swift, Node.js, Go, and more, all sharing the same kernel implementations.
Extensibility – A modular architecture that allows new script kernels (e.g., Georgian, Korean) to be added without rewriting core logic.

How It Works

1. Safe Window Selection

Instead of folding the entire needle, StringZilla scans for a safe 16‑byte window that is guaranteed to be fold‑stable. The algorithm walks the needle by UTF‑8 codepoints, folds forward, and stops when the window reaches 16 bytes or encounters an alarm—a character that would shrink or expand under folding. By focusing on this window, the library sidesteps the expensive full‑fold path for most needles.

2. Script‑Aware SIMD Kernels

The library partitions Unicode into script‑ish buckets (ASCII, Western European, Cyrillic, Greek, etc.) and assigns each a dedicated AVX‑512 kernel. These kernels use a mix of ternary logic, byte‑table shuffles, and range checks to compare the haystack against the probe bytes. For example, the Cyrillic kernel uses a single VPSHUFB lookup to map the high nibble of the second UTF‑8 byte to the required offset, avoiding branchy logic.

3. Verification and Fallbacks

Every SIMD hit is followed by a lightweight verifier that checks the preceding and following bytes to ensure a true match. If a probe triggers an alarm or the SIMD path fails, the library falls back to a serial, Unicode‑correct algorithm—either a hash‑free scan for short needles or a Rabin‑Karp style rolling hash for longer ones.

Benchmarks

The authors evaluated StringZilla against ICU, PCRE2, and the Python regex module using the Leipzig Wikipedia corpora (≈ 100 MB per language). The results, reproduced from the library’s own benchmark suite, show:

Language	StringZilla (GB/s)	ICU (GB/s)	PCRE2 (GB/s)
English	10.9×	0.081	1.42
German	13.6×	0.081	1.24
Russian	10.6×	0.147	0.257
Arabic	13.7×	0.193	0.253

Across the board, StringZilla delivers 50–150× speedups over the serial ICU pipeline and up to 20–150× over PCRE2 for case‑insensitive searches. The library also achieves 5–15 GB/s throughput on cached data, approaching the limits of NVMe SSD read speeds.

Usage Across Languages

// C++ example
namespace sz = ashvardanian::stringzilla;
auto [offset, len] = sz::string_view("Der große Hund").utf8_case_insensitive_find("GROSSE");

# Python example
import stringzilla as sz
print(sz.utf8_case_insensitive_find("Der große Hund", "GROSSE"))  # 4

// Rust example
use stringzilla::stringzilla::{utf8_case_insensitive_find, Utf8CaseInsensitiveNeedle};
let needle = Utf8CaseInsensitiveNeedle::new(b"STRASSE");
assert_eq!(utf8_case_insensitive_find("Straße", &needle), Some((0, 7)));

The library ships as a pre‑compiled wheel for Python, a static header‑only library for C/C++, and a Cargo crate for Rust, making it easy to drop into existing codebases.

Future Directions

ARM Support – Porting the kernels to NEON/SVE will broaden the library’s reach to mobile and embedded devices.
New Script Kernels – Georgian, Korean, and other high‑frequency scripts are on the roadmap.
GPU Acceleration – A parallel extension for NVIDIA GPUs is available under the stringzilla-cuda package.

Conclusion

StringZilla demonstrates that the bottleneck in Unicode‑aware text processing is not the encoding itself but the lack of hardware‑accelerated, spec‑compliant algorithms. By marrying a meticulous treatment of Unicode case‑folding with AVX‑512 SIMD kernels, the library offers developers a drop‑in replacement that is both faster and more correct than existing solutions.

Source: https://ashvardanian.com/posts/search-utf8/