Efficiently Sorting Hierarchical Data with String Interning
Share this article
When serializing hierarchical data to JSON, developers often face a common challenge: input records may contain interleaved substructures that must become contiguous in the output. Traditional lexicographical sorting would solve grouping but disrupt the original field order. A recent solution by Chris Wellons introduces token-based namespaces and string interning to balance both requirements efficiently.
The Core Problem
Consider sensor data fields like timestamp, point.x, point.y, foo.bar.z, point.z, foo.bar.y, and foo.bar.x. Input records store values in this interleaved order, but JSON demands contiguous substructures:
{
"timestamp": 1758158348,
"point": {"x": 1.23, "y": 4.56, "z": 7.89},
"foo": {"bar": {"z": -100, "y": -200, "x": -300}}
}
Simply sorting fields alphabetically would misplace timestamp and ignore structural relationships. The solution? Hierarchical tokenization with namespaces.
Token-Based Namespaces
- String Interning: Assign unique monotonic tokens to each path segment (e.g.,
point→ token2). - Namespace Binding: Tokens inherit context from parent structures using a
(parent_token, segment)→child_tokenmap:
{0, "point"} → 2 {2, "x"} → 3 {0, "foo"} → 5 {5, "bar"} → 6
- Path Resolution: Convert
foo.bar.xinto token sequence[5,6,10]—preserving hierarchy without redundant string storage.
Algorithmic Advantages
- O(n log n) Complexity: Token sequences sort lexically to group substructures:
point.x → [2,3] point.y → [2,4] point.z → [2,8] // Now contiguous!
- Quadratic Avoidance: Early approaches using full path prefixes (
"foo.bar.x") risked O(n²) cost. Namespaces reduce segments to constant-time lookups. - Order Preservation: Top-level fields like
timestampretain position unless substructure grouping necessitates reordering.
C Implementation Highlights
A hash-trie maps (token, string) pairs to tokens, leveraging arena allocation for zero-cleanup memory management:
typedef struct Map Map;
struct Map {
Map *child[4];
Token namespace;
Str segment;
Token token;
};
Token *upsert(Map **m, Token namespace, Str segment, Arena *a) {
// ... Hash trie insertion logic
}
Field parsing builds token sequences before sorting:
Slice(Field) parse_fields(Str fieldlist, Arena *a) {
for (Cut c = cut(fieldlist, ','); c.ok; c = cut(c.tail, ',')) {
Field field = {.name = c.head, .index = fields.len};
Token prev = 0; // Root namespace
for (Cut f = cut(field.name, '.'); f.ok; f = cut(f.tail, '.')) {
Token *t = upsert(&strtab, prev, f.head, a);
*push(a, &field.tokens) = *t; // Append token
prev = *t; // Descend namespace
}
*push(a, &fields) = field;
}
field_sort(fields, *a); // Lex sort by token sequences
return fields;
}
Why This Matters
Beyond JSON serialization, this technique optimizes:
- Database Systems: Restructuring nested query results
- Network Protocols: Parsing packed binary formats with scattered substructures
- Memory Efficiency: Interning minimizes string duplication across millions of records
The author's prototype even compiles token sequences into a JSON-bytecode interpreter, hinting at further optimizations like instruction coalescing. As datasets grow more hierarchical, solutions blending tokenization with namespace-aware sorting will become indispensable for high-performance data transformation.
Source: Adapted from nullprogram.com