When a user already has a document and wants related items, the More Like This feature provides the answer.

Introduction

When a user already has a document and wants related items, the More Like This feature provides the answer.

Classic lexical approach

The early method used term overlap to find similar documents. The system extracted important words from the source text, built a query, and searched for matching terms. Parameters such as min_term_freq, min_doc_freq, max_doc_freq, and max_query_terms controlled the query generation. This technique worked well for exact identifiers like error codes, product SKUs, part numbers, function names, stack traces, legal wording, and near duplicate descriptions. Because the inverted index and analyzers were already present, the operation was inexpensive to run. However the method struggled when documents described the same concept with different wording. Synonyms, paraphrases, and cross language matches were often missed.

When embeddings change the game

The rise of embedding models introduced numerical vectors that capture semantic meaning. A document is stored as a dense vector, and similarity is measured by vector proximity. This shift lets the system find documents that convey the same idea even if the wording differs. For example, a description of a memory leak may match a report about unbounded heap growth because the vectors reflect the underlying concept.

Hybrid search in practice

Production systems combine lexical and vector methods. blog-post

Exact matches for identifiers are handled by full text search, while semantic similarity is added through vector search. Filters such as access control, tenant isolation, and date ranges are applied before ranking. A reranking step can refine the order using a more precise model. The result is a balanced set of results that respects both precise and meaning based relevance.

Timeline of development

The method evolved through several phases. In the 2000s lexical analysis dominated, using TF IDF or BM25 to match term overlap. The 2010s saw word2vec and GloVe become popular, enabling word level embeddings that were extended to whole documents. The 2020s introduced FAISS and other approximate nearest neighbor libraries, making vector search feasible on massive collections. Mid 2020s product features such as retrieval augmented generation and recommendation systems made lookup by stored vectors a common pattern. The overall trend is a move from pure term matching toward matching document vector representations while still preserving lexical precision where needed.

What to keep in mind

Semantic similarity does not replace all search engineering. Exact matches for identifiers, error codes, and other strict formats still require lexical search. Embedding model versioning and metadata are essential to keep the vector space consistent. Access control rules, tenant filters, and other security constraints must be applied during the search. Hybrid search pipelines need careful tuning of weighting between lexical and vector scores. Reranking can improve final ordering but adds complexity. Monitoring should track precision, recall, false positive rate, and the impact of approximate nearest neighbor approximation errors.

Conclusion

More Like This has shifted from a purely lexical process to a hybrid approach that blends term based matching with vector based similarity. The core idea remains unchanged: a user selects a source document and the system returns materials that are relevant to it, taking both exact and semantic cues into account. As the technology matures, the balance between precision and coverage will continue to evolve, guided by real world usage and ongoing research.

#Semantic Search #hybrid search #Embeddings #Vector Search #similarity features

The Evolution of 'More Like This' Content