Daniel Lemire explores how ARM's Scalable Vector Extensions (SVE/SVE2) offer significantly faster character matching compared to traditional NEON instructions, with performance gains reaching 25% while reducing instruction count by 30%.
In the continuous pursuit of optimizing text processing performance, Daniel Lemire presents a compelling examination of character matching techniques on ARM processors, focusing on the transition from traditional NEON instructions to the newer Scalable Vector Extensions (SVE/SVE2). This optimization represents more than just a technical improvement; it exemplifies how evolving hardware capabilities can fundamentally reshape our approach to fundamental string operations.
The core problem addressed is vectorized classification—identifying ASCII whitespace characters and JSON structural characters within text. This operation forms a critical subproblem in high-performance JSON parsing libraries like simdjson and appears in various text processing scenarios including DNS record parsing. The challenge lies in efficiently processing these character classifications across large datasets while minimizing computational overhead.
Traditional approaches using NEON instructions have relied on a table-driven, branch-free classifier that splits each byte into low and high nibbles, performs SIMD table lookups, and combines results through bitwise operations. This method, while effective, requires multiple equality comparisons per character and becomes increasingly inefficient as data volumes grow.
The introduction of SVE/SVE2 marks a significant architectural shift. Unlike fixed-length SIMD registers, SVE/SVE2 employs scalable registers that theoretically adapt to chip maker specifications. In practice, however, commodity chips have standardized on 128-bit registers containing 16 bytes. The true innovation lies in the match and nmatch instructions, which simultaneously test set membership for every byte in a vector against a predefined lookup set. These instructions replace what would otherwise require numerous equality comparisons and OR-reductions.
Lemire's implementation demonstrates elegant simplicity using SVE2 intrinsics. By loading character sets into SIMD registers and applying match instructions, the code produces predicates indicating character matches. These predicates can then be materialized as byte vectors and converted to scalar bitmasks. The approach handles the 64-byte processing blocks required by simdjson through four 16-byte chunks, maintaining compatibility with existing NEON operations for final mask generation.
Benchmark results reveal substantial performance improvements. On an AWS Graviton 4 processor (Neoverse V2), the SVE/SVE2 implementation achieves 14.4 GB/s compared to 11.4 GB/s for the NEON equivalent—a 25% performance gain. More impressively, the new approach reduces instruction count by 30% while maintaining similar instructions per cycle (3.8 vs 3.5), indicating better computational efficiency.
These improvements have significant implications for text processing in cloud environments where ARM processors like Amazon's Graviton line, Microsoft's Cobalt, and Google's Axium dominate. The simdjson library, already renowned for its JSON parsing performance, could benefit substantially from these optimizations, potentially enabling even faster data processing in web services and API endpoints.
The article also highlights an interesting architectural tension in processor design. While SVE/SVE2's variable-length register approach offers theoretical flexibility, its implementation has largely converged to fixed 128-bit registers. This convergence suggests that while architectural innovation continues, practical implementation often follows established patterns. The interoperability between SVE/SVE2 and NEON further enables gradual adoption without requiring complete code rewrites.
Several questions remain open. Apple's absence from SVE/SVE2 adoption creates an interesting divergence in the ARM ecosystem, potentially limiting optimization opportunities for iOS and macOS applications. Additionally, the article focuses exclusively on character matching; extending these techniques to more complex parsing operations presents both opportunities and challenges.
As we look to the future, this work exemplifies how hardware-software co-design continues to push performance boundaries. The match and nmatch instructions may indeed represent the fastest practical approach to character matching on ARM processors, but they also point toward a broader paradigm shift in how we think about vector operations and data processing. The evolution from NEON to SVE/SVE2 demonstrates that even fundamental operations continue to benefit from architectural innovation, with significant implications for performance-critical applications across the computing spectrum.
For those interested in implementing these optimizations, the benchmark code is available on GitHub, and the full research can be found in Lemire's related publications on JSON parsing and DNS record processing.

Comments
Please log in or register to join the discussion