Building a Blazing-Fast Full-Text Search Engine in Go
Share this article
The Invisible Engine Powering Your Queries
Full-text search (FTS) silently powers countless daily interactions—from e-commerce product discoveries to troubleshooting programming errors. Yet most developers rarely consider the intricate machinery behind these instant results. When Artem Krylysov set out to build an FTS engine capable of searching 600K Wikipedia abstracts in under a millisecond, he confronted fundamental questions about text processing, indexing, and retrieval efficiency.
Why Not Grep? The Naive Approach
Krylysov's journey began with a brute-force implementation in Go:
func search(docs []document, term string) []document {
var results []document
for _, doc := range docs {
if strings.Contains(doc.Text, term) {
results = append(results, doc)
}
}
return results
}
While functional for small datasets (103ms for 600K documents), this approach revealed critical flaws:
1. Case sensitivity missed "Cat" when searching for "cat"
2. Substring matches returned false positives ("caterpillar")
Switching to regex ((?i)\bcat\b) solved matching precision but cratered performance—queries now took over 2 seconds. At scale, this linear O(n) complexity becomes prohibitive.
The Inverted Index Revolution
The breakthrough came with inverted indexes—a data structure mapping tokens to document IDs. Krylysov illustrates this with a book index analogy:
Inverted indexes transform document-centric data into term-centric maps, enabling instant lookups.
Building the index required sophisticated text analysis:
func analyze(text string) []string {
tokens := tokenize(text) // Split on word boundaries
tokens = lowercaseFilter(tokens)
tokens = stopwordFilter(tokens) // Remove "a", "the", etc.
tokens = stemmerFilter(tokens) // Reduce "donuts" → "donut"
return tokens
}
This pipeline converted "A donut on a glass plate" into ["donut", "glass", "plate"]—normalized tokens ready for indexing.
Querying at Lightning Speed
The index structure enabled sub-millisecond searches:
func (idx index) search(text string) []int {
tokens := analyze(text)
var results []int
for _, token := range tokens {
if ids, exists := idx[token]; exists {
results = intersect(results, ids) // Boolean AND
}
}
return results
}
Critical to performance was the intersect function—a linear merge algorithm for sorted ID lists. For the query "small wild cat", it identified just two relevant documents among 600K in 18 microseconds.
Asian Golden Cat—one of two species found via "small wild cat" query (Image: Karen Stout, CC BY-SA 2.0)
Beyond Basic Search: Next Steps
While Krylysov's engine handles exact phrase searches, he outlines compelling enhancements:
- Boolean operators: Implement OR/NOT logic
- Disk-backed storage: Avoid rebuilding indexes on restart
- Relevance ranking: Sort results by term proximity/frequency
- Roaring Bitmaps: Optimize ID set storage
The complete source code demonstrates how minimal Go can achieve remarkable efficiency—processing Wikipedia's 913MB abstract dump with under 200 lines of code.
As search interfaces evolve, understanding these foundational algorithms remains crucial. Whether implementing site search or optimizing database queries, the inverted index proves that sometimes the most powerful solutions emerge from intelligently restructuring the problem.
Source: Artem Krylysov