The Invisible Engine Powering Your Queries

Full-text search (FTS) silently powers countless daily interactions—from e-commerce product discoveries to troubleshooting programming errors. Yet most developers rarely consider the intricate machinery behind these instant results. When Artem Krylysov set out to build an FTS engine capable of searching 600K Wikipedia abstracts in under a millisecond, he confronted fundamental questions about text processing, indexing, and retrieval efficiency.

Why Not Grep? The Naive Approach

Krylysov's journey began with a brute-force implementation in Go:

func search(docs []document, term string) []document {
    var results []document
    for _, doc := range docs {
        if strings.Contains(doc.Text, term) {
            results = append(results, doc)
        }
    }
    return results
}

While functional for small datasets (103ms for 600K documents), this approach revealed critical flaws:
1. Case sensitivity missed "Cat" when searching for "cat"
2. Substring matches returned false positives ("caterpillar")

Switching to regex ((?i)\bcat\b) solved matching precision but cratered performance—queries now took over 2 seconds. At scale, this linear O(n) complexity becomes prohibitive.

The Inverted Index Revolution

The breakthrough came with inverted indexes—a data structure mapping tokens to document IDs. Krylysov illustrates this with a book index analogy:

Article illustration 1

Inverted indexes transform document-centric data into term-centric maps, enabling instant lookups.

Building the index required sophisticated text analysis:

func analyze(text string) []string {
    tokens := tokenize(text) // Split on word boundaries
    tokens = lowercaseFilter(tokens)
    tokens = stopwordFilter(tokens) // Remove "a", "the", etc.
    tokens = stemmerFilter(tokens) // Reduce "donuts" → "donut"
    return tokens
}

This pipeline converted "A donut on a glass plate" into ["donut", "glass", "plate"]—normalized tokens ready for indexing.

Querying at Lightning Speed

The index structure enabled sub-millisecond searches:

func (idx index) search(text string) []int {
    tokens := analyze(text)
    var results []int
    for _, token := range tokens {
        if ids, exists := idx[token]; exists {
            results = intersect(results, ids) // Boolean AND
        }
    }
    return results
}

Critical to performance was the intersect function—a linear merge algorithm for sorted ID lists. For the query "small wild cat", it identified just two relevant documents among 600K in 18 microseconds.

Article illustration 4

Asian Golden Cat—one of two species found via "small wild cat" query (Image: Karen Stout, CC BY-SA 2.0)

Beyond Basic Search: Next Steps

While Krylysov's engine handles exact phrase searches, he outlines compelling enhancements:
- Boolean operators: Implement OR/NOT logic
- Disk-backed storage: Avoid rebuilding indexes on restart
- Relevance ranking: Sort results by term proximity/frequency
- Roaring Bitmaps: Optimize ID set storage

The complete source code demonstrates how minimal Go can achieve remarkable efficiency—processing Wikipedia's 913MB abstract dump with under 200 lines of code.

As search interfaces evolve, understanding these foundational algorithms remains crucial. Whether implementing site search or optimizing database queries, the inverted index proves that sometimes the most powerful solutions emerge from intelligently restructuring the problem.

Source: Artem Krylysov