A Finnish‑English dictionary app that once required a 3 GB SQLite download was rebuilt in Rust using the fst crate, shrinking the data store to 10 MB—a 300× reduction—by exploiting the compressive power of minimal acyclic deterministic finite‑state automata for prefix and suffix sharing.
When I first tackled Taskusanakirja (tsk), a Finnish‑English pocket dictionary with instant search‑as‑you‑type, the most natural data structure seemed to be a trie. A trie gives O(k) lookup where k is the length of the query, and with a few simple tricks—capping the number of results, memoising all 1‑, 2‑, and 3‑character prefixes—I could squeeze the whole thing into roughly 60 MB. That worked fine for a modest word list, but Finnish is a highly agglutinative language: a single lemma can spawn dozens of inflected forms, and the total number of distinct surface forms quickly balloons into the tens of millions.
The trie hit a wall
A naïve trie stores every distinct suffix path separately. Even with prefix sharing, the sheer volume of suffixes caused the structure to bloat beyond the 50 MB RAM budget of the low‑end laptops I was targeting. My stop‑gap solution was to off‑load the inflections to a SQLite database with the Full‑Text Search (FTS) extension. The database answered queries instantly, but the download size exploded to 3 GB—far from pocket‑friendly.
A rust‑y revelation
While scrolling through Rust community posts, I stumbled on Andrew Gallant’s (BurntSushi) essay “Index 1,600,000,000 Keys with Automata and Rust”. The piece demonstrated that finite‑state transducers (FSTs), built with the fst crate, can store ordered sets of strings in a fraction of the space a trie needs, while still supporting fast prefix, suffix, and fuzzy searches.
The key insight is that an acyclic deterministic finite‑state automaton (DAFSA) merges identical sub‑trees—not just common prefixes but also common suffixes. In a language like Finnish, where a handful of suffixes recur thousands of times, this suffix sharing yields dramatic compression.
The migration process
- Extract the data – I wrote a tiny Rust program that streamed every entry from the SQLite dump (lemma, inflection, definition) and fed it into the
fst::MapBuilder. - Encode the mapping – Each surface form became a key; the value was a compact offset into a separate, gzipped JSON blob that held the full definition. Offsets are 32‑bit integers, so the map stays tiny.
- Finalize the automaton – The builder writes a binary representation that is essentially a read‑only, memory‑mapped structure. No runtime allocation is required beyond loading the file.
- Replace the lookup layer – The old SQLite query was swapped for a call to
fst::Map::getfollowed by a slice into the definition blob.
The result? A 10 MB binary that fits comfortably on any laptop, a 300× reduction compared to the original SQLite file, and lookup latency indistinguishable from the previous implementation.
Why FSTs beat tries for Finnish
| Aspect | Trie | Minimal DAFSA (FST) |
|---|---|---|
| Prefix sharing | ✔︎ | ✔︎ |
| Suffix sharing | ✘ | ✔︎ |
| Static data size | Linear in number of distinct suffixes | Sub‑linear – identical suffix sub‑trees collapse |
| Runtime memory overhead | Requires node allocations or a large in‑memory array | Memory‑maps a read‑only byte slice |
In practice, the Finnish dictionary contains roughly 40 million surface forms, but only a few hundred distinct suffix patterns. The DAFSA merges all those repetitive tails, turning what would be gigabytes of node objects into a compact adjacency table.
Broader implications
- Portability – The final binary is a single executable that runs on Windows, macOS, and Linux without external dependencies. No need to ship a separate database engine.
- Energy efficiency – Memory‑mapped reads avoid costly heap allocations, which translates to lower CPU usage and longer battery life on the target low‑end hardware.
- Maintainability – The data generation step is a one‑off script; updating the dictionary simply means re‑running the Rust builder and bundling the new
.fstfile. - General applicability – Any application that needs fast prefix or suffix lookup over a static corpus—spell‑checkers, code completion engines, linguistic tools—can benefit from the same pattern.
Counter‑perspectives
Some developers argue that introducing Rust and the fst crate adds unnecessary complexity to a project that originally lived comfortably in Go. The learning curve for Rust’s ownership model can be steep, and the ecosystem around FSTs is still niche compared to the mature SQLite tooling. Moreover, FSTs excel when the dataset is static; if the dictionary needed frequent runtime updates, a mutable structure like a B‑tree would still be preferable.
Nevertheless, the trade‑off in this case is clear: the dictionary’s data never changes at runtime, and the memory budget is the dominant constraint. By embracing a data‑structure that mirrors the linguistic regularities of Finnish, the project achieved a level of efficiency that would have been impossible with a naïve trie or a heavyweight relational engine.
Closing thoughts
The episode illustrates a broader lesson for engineers: when a problem sits at the intersection of speed, portability, and poor memory ergonomics, it is worth stepping back and asking whether a different algorithmic abstraction—here, a minimal DAFSA—might be a better fit than the familiar tool. The 300× space reduction was not a miracle of hardware; it was the result of matching the data’s intrinsic redundancy with a structure designed to exploit that redundancy.
If you are curious to experiment yourself, the full source for the conversion script lives in the author’s GitHub repository: github.com/andrew-quinn/taskusanakirja-fst. The fst crate documentation provides a clear walkthrough of building and querying automata, and the compiled binary can be dropped into any existing deployment pipeline.
In the end, a pocket dictionary should indeed fit in a pocket—both metaphorically and literally.
Comments
Please log in or register to join the discussion