Efficiently Sorting Hierarchical Data with String Interning

Discover an elegant solution for transforming non-contiguous nested data structures into JSON-friendly formats while preserving field order. By leveraging token-based namespaces and string interning, this approach avoids quadratic complexity pitfalls and optimizes memory usage. Learn how hierarchical field sorting tackles real-world data serialization challenges with C implementation insights.

When serializing hierarchical data to JSON, developers often face a common challenge: input records may contain interleaved substructures that must become contiguous in the output. Traditional lexicographical sorting would solve grouping but disrupt the original field order. A recent solution by Chris Wellons introduces token-based namespaces and string interning to balance both requirements efficiently.

The Core Problem

Consider sensor data fields like timestamp, point.x, point.y, foo.bar.z, point.z, foo.bar.y, and foo.bar.x. Input records store values in this interleaved order, but JSON demands contiguous substructures:

{
  "timestamp": 1758158348,
  "point": {"x": 1.23, "y": 4.56, "z": 7.89},
  "foo": {"bar": {"z": -100, "y": -200, "x": -300}}
}

Simply sorting fields alphabetically would misplace timestamp and ignore structural relationships. The solution? Hierarchical tokenization with namespaces.

Token-Based Namespaces

String Interning: Assign unique monotonic tokens to each path segment (e.g., point → token 2).
Namespace Binding: Tokens inherit context from parent structures using a (parent_token, segment) → child_token map:
```
{0, "point"} → 2
{2, "x"}    → 3
{0, "foo"}   → 5
{5, "bar"}   → 6
```
Path Resolution: Convert foo.bar.x into token sequence [5,6,10]—preserving hierarchy without redundant string storage.

Algorithmic Advantages

O(n log n) Complexity: Token sequences sort lexically to group substructures:
```
point.x → [2,3]
point.y → [2,4]
point.z → [2,8]  // Now contiguous!
```
Quadratic Avoidance: Early approaches using full path prefixes ("foo.bar.x") risked O(n²) cost. Namespaces reduce segments to constant-time lookups.
Order Preservation: Top-level fields like timestamp retain position unless substructure grouping necessitates reordering.

C Implementation Highlights

A hash-trie maps (token, string) pairs to tokens, leveraging arena allocation for zero-cleanup memory management:

typedef struct Map Map;
struct Map {
  Map *child[4];
  Token namespace;
  Str segment;
  Token token;
};

Token *upsert(Map **m, Token namespace, Str segment, Arena *a) {
  // ... Hash trie insertion logic
}

Field parsing builds token sequences before sorting:

Slice(Field) parse_fields(Str fieldlist, Arena *a) {
  for (Cut c = cut(fieldlist, ','); c.ok; c = cut(c.tail, ',')) {
    Field field = {.name = c.head, .index = fields.len};
    Token prev = 0; // Root namespace
    for (Cut f = cut(field.name, '.'); f.ok; f = cut(f.tail, '.')) {
      Token *t = upsert(&strtab, prev, f.head, a);
      *push(a, &field.tokens) = *t; // Append token
      prev = *t; // Descend namespace
    }
    *push(a, &fields) = field;
  }
  field_sort(fields, *a); // Lex sort by token sequences
  return fields;
}

Why This Matters

Beyond JSON serialization, this technique optimizes:

Database Systems: Restructuring nested query results
Network Protocols: Parsing packed binary formats with scattered substructures
Memory Efficiency: Interning minimizes string duplication across millions of records

The author's prototype even compiles token sequences into a JSON-bytecode interpreter, hinting at further optimizations like instruction coalescing. As datasets grow more hierarchical, solutions blending tokenization with namespace-aware sorting will become indispensable for high-performance data transformation.

Source: Adapted from nullprogram.com