Scaling Human Judgment: How Dropbox Uses LLMs to Improve Labeling for RAG Systems

Dropbox engineers have developed a hybrid approach that combines human expertise with LLM automation to dramatically improve the quality of relevance labeling for retrieval-augmented generation systems.

Dropbox engineers have developed an innovative approach to improve the quality of relevance labeling for retrieval-augmented generation (RAG) systems by combining human expertise with large language model automation. This hybrid method addresses a critical bottleneck in RAG systems where document retrieval quality directly impacts the relevance of generated responses.

The Challenge of Document Retrieval in RAG Systems

In RAG systems like Dropbox Dash, the quality of search ranking determines which documents are passed to the LLM for generating responses. As Dropbox principal engineer Dmitriy Meyerzon explains, with millions or even billions of documents in enterprise search indexes, only a small subset can be sent to the LLM. This makes the quality of search ranking—and the labeled relevance data used to train it—critical to the final answer quality.

The traditional approach relies on human judges to label query-document pairs according to how well each document satisfies a given query. However, this purely human-based labeling is expensive, slow, and inconsistent, creating a significant bottleneck in scaling RAG systems.

Human-Calibrated LLM Labeling: A Hybrid Solution

To address these limitations, Dropbox introduced a complementary approach where LLMs generate relevance judgments at scale. This method is cheaper, more consistent, and can easily scale to large document sets. However, LLMs are not perfect evaluators, so their judgments must be assessed before being used for training.

The solution, called "human-calibrated LLM labeling," follows a straightforward process:

Humans label a small, high-quality dataset that serves as ground truth
This dataset is used to calibrate the LLM evaluator
The calibrated LLM then generates hundreds of thousands or even millions of labels
This amplifies human effort by roughly 100× while maintaining quality

The key insight is that LLMs don't replace the ranking system entirely. Using them directly for query-time ranking would be too slow and limited by context window constraints. Instead, they serve as a powerful tool for generating training data.

Evaluation and Quality Assurance

The evaluation step involves comparing LLM-generated relevance ratings with human judgments on a test subset of query-document pairs not included in the training set. This comparison focuses particularly on the hardest mistakes—cases where LLM judgments disagree with user behavior.

For example, when users click documents the LLM rated low or skip documents the LLM rated high, these discrepancies provide the strongest learning signal. By analyzing these edge cases, the system can be fine-tuned to better align with actual user needs and behaviors.

Context Matters: The "Diet Sprite" Problem

One important consideration in relevance labeling is that context is often critical for judging relevance. As Meyerzon notes, a query like "diet sprite" at Dropbox refers to an internal performance tool rather than a beverage. Without this context, an LLM might make incorrect relevance judgments.

To address this, Dropbox's approach allows LLMs to run additional searches, look up context, and understand internal terminology. This capability dramatically improves labeling accuracy by ensuring the LLM has the necessary background information to make informed judgments.

Impact and Scalability

The results of this human-calibrated approach are compelling. By combining human expertise with LLM automation, Dropbox can consistently amplify human judgment at scale. This provides an effective way to improve RAG systems without the prohibitive costs and inconsistencies of purely human-based labeling.

For organizations building RAG systems, this approach offers a practical path forward. It acknowledges that while LLMs are powerful tools for scaling, human judgment remains essential for quality control and calibration. The hybrid model leverages the strengths of both approaches: human expertise for accuracy and LLMs for scale.

Broader Implications for AI Systems

This approach has implications beyond just RAG systems. It represents a broader pattern in AI development where human-AI collaboration produces better results than either humans or AI working alone. By thoughtfully combining human judgment with machine learning capabilities, organizations can build more accurate, scalable, and cost-effective AI systems.

The Dropbox team's work demonstrates that the future of AI development isn't about replacing human judgment but rather about creating systems that amplify and extend human capabilities. This human-calibrated approach could be applied to many other areas where large-scale labeling or evaluation is needed, from content moderation to quality assurance in various domains.

For developers and organizations working with RAG systems or similar AI applications, the key takeaway is that thoughtful integration of human and machine intelligence can solve scalability challenges while maintaining or even improving quality. The success of this approach suggests that hybrid human-AI systems will likely become increasingly important as AI applications continue to scale.

#LLMs #RAG #Human-AI Collaboration #Labeling #Retrieval-Augmented Generation

Scaling Human Judgment: How Dropbox Uses LLMs to Improve Labeling for RAG Systems

Comments