Foundation Models for Ranking: Challenges, Successes, and Lessons Learned

Netflix's journey from multi-model ranking systems to unified contextual recommenders and foundation models, exploring how they built task-agnostic user understanding while solving system challenges like high-throughput inference and the personalization-relevance tradeoff.

Moumita Bhattacharya, machine learning manager at Netflix, presents a comprehensive look at the evolution of Netflix's ranking systems, from traditional multi-model architectures to their innovative Unified Contextual Recommender (UniCoRn) and foundation models. The talk, delivered at QCon London, explores how Netflix tackled the challenge of serving personalized content to hundreds of millions of users while managing system complexity and performance requirements.

The Evolution of Ranking Systems

Bhattacharya begins by establishing the context of search and recommendation as omnipresent applications of machine learning across major platforms like Netflix, Spotify, Amazon, and others. With user bases exceeding 100 million and catalogs containing hundreds of millions of items, the challenge becomes how to efficiently rank and present relevant content without overwhelming users or systems.

The traditional approach involves a two-stage ranking system. First, a lightweight candidate selection or retrieval stage reduces the catalog from millions of items to hundreds of thousands. Then, a more complex second-pass ranker applies sophisticated models to deliver personalized, relevant results. This talk focuses primarily on optimizing the second-pass ranking stage.

Unified Contextual Recommender (UniCoRn)

A key innovation presented is UniCoRn, a model that unifies search and recommendation tasks that have historically been treated separately in academic and industry contexts. Bhattacharya explains that while search conferences like SIGIR and recommendation conferences like RecSys have traditionally operated in isolation, Netflix recognized that both tasks fundamentally involve ranking items based on context.

UniCoRn addresses several key differences between search and recommendation:

Context: Search has explicit user intent through queries, while recommendation may have no context beyond user ID
Engagement patterns: Users interact differently with search versus recommendation interfaces
Candidate sets: Different business rules and candidate selection processes apply

To unify these tasks, UniCoRn incorporates:

Unified context features including query, country, language, task type, and entity-specific features
Data-driven multitask learning that combines engagement data across all product surfaces
Context-specific features that allow the model to learn task-specific behaviors while benefiting from shared learning

The model architecture is a fully connected deep neural network with skip connections, optimizing for likelihood of positive engagement (typically play events). This unified approach replaced four separate models in production, reducing maintenance costs and enabling innovation to propagate across multiple use cases simultaneously.

System Considerations and Infrastructure

The transition to UniCoRn required significant infrastructure changes. Previously, Netflix maintained separate pipelines for different ranking tasks, each requiring independent label preparation, feature generation, model training, and serving infrastructure. This proliferation created maintenance overhead and technical debt.

UniCoRn consolidated these pipelines by standardizing the core components while maintaining task-specific label preparation. The unified approach simplified both offline training pipelines and online serving infrastructure. However, online serving presented new challenges:

Different product surfaces have varying SLA requirements
Some contexts require caching while others don't
Latency sensitivity varies across use cases
Throughput and compute cost optimization becomes critical at scale

To address these challenges, Netflix implemented flexible serving infrastructure with knobs to tune characteristics like caching, latency, and data freshness for different use cases while maintaining a unified model core.

Foundation Models: The Next Evolution

The talk then explores Netflix's development of foundation models, inspired by the success of large language models like GPT and Llama. The key insight is that foundation models can holistically learn member preferences across long-term and short-term contexts while being task-agnostic.

Bhattacharya draws parallels between language models and user behavior modeling:

Titles/entities in Netflix's context are analogous to words in language
User engagement sequences resemble documents of text
The learning objective follows a self-supervised approach: predict the next item given historical context

However, significant differences exist:

Titles are dynamic, with new content added daily, unlike the relatively static vocabulary of language
User behavior is more heterogeneous and less structured than language
The signal-to-noise ratio in user interaction data is lower

Netflix's foundation model is a hierarchical multitask learning model built on a transformer architecture. It learns to predict:

The next item a user will engage with
User intent (e.g., movie vs. game, genre preferences)
Long-term and short-term preferences

The model processes user interaction history through hierarchical tokenization, rolling up repeated interactions within time windows to manage context length while preserving information content.

The "Magic" of Foundation Models

The most compelling aspect of the presentation is how foundation models enable personalization across Netflix's ecosystem. By injecting foundation model understanding into UniCoRn, Netflix achieved:

7% offline lift for search tasks
10% offline lift for recommendation tasks
Significant online improvements (specific metrics not disclosed)

The foundation model acts as a "Harry Potter" that brings magical personalization capabilities to existing systems. Without requiring separate personalization models for each use case, the foundation model's understanding of user preferences enables immediate personalization across search, recommendation, and other ranking tasks.

Challenges and Considerations

Bhattacharya candidly addresses several challenges in implementing these systems:

Training and Compute Challenges:

Large foundation models require substantial computing resources
GPU optimization, data sharding, and efficient training algorithms are critical
Multi-GPU setups are necessary for both training and inference

Serving and Inference:

Balancing latency requirements with model complexity
Managing cost while maintaining performance
Implementing appropriate caching strategies
Ensuring robust evaluation both offline and online

Model Design Trade-offs:

Personalization vs. relevance: Over-personalization can harm search results when lexical relevance is important
Avoiding filter bubbles and concentration effects
Managing cold start problems for new content
Designing appropriate reward functions and objectives

System Integration:

Ensuring platform agnosticism across web, mobile, and other interfaces
Managing heterogeneous context inputs
Providing flexible APIs for different use cases

Key Takeaways

The presentation concludes with several important insights:

Foundation models can holistically capture member preferences across different time horizons and contexts
Unified models can improve both search and recommendation by learning shared representations while maintaining task-specific capabilities
Personalization can be efficiently injected into existing systems through foundation model integration
Infrastructure considerations are critical for production deployment at scale
Trade-offs between personalization and relevance must be carefully managed

Technical Implementation Details

While specific implementation details are limited due to proprietary constraints, Bhattacharya provides insight into several technical approaches:

Feature engineering: Cross features and entity-specific features are crucial for model performance
Model architecture: Deep & Cross Network (DCN-V2) and transformer-based architectures are employed
Data processing: Hierarchical tokenization and roll-up strategies manage context length
Multitask learning: Joint training on multiple objectives improves generalization
Inference optimization: Separate model deployments for different SLA requirements

Future Directions

The presentation suggests several areas for future exploration:

Leveraging additional modalities beyond text and user history (e.g., video and image features)
Improving diversity and avoiding filter bubbles through exploration strategies
Enhancing editorial review processes while maintaining ML-driven personalization
Optimizing foundation model training for dynamic content environments

Conclusion

Bhattacharya's presentation provides a comprehensive overview of Netflix's journey from traditional ranking systems to unified contextual recommenders and foundation models. The key insight is that by building models that can understand user preferences holistically and being task-agnostic, Netflix can deliver more personalized experiences while reducing system complexity and maintenance overhead.

The "magic" of foundation models lies in their ability to capture nuanced user preferences and enable personalization across multiple use cases without requiring separate models for each scenario. This approach represents a significant evolution in how large-scale recommendation systems can be designed and deployed.

For practitioners in the field, the presentation offers valuable lessons about the importance of considering both modeling innovations and system infrastructure challenges when building production-scale machine learning systems. The balance between personalization and relevance, the management of dynamic content, and the optimization of serving infrastructure are all critical considerations for successful deployment.

As the field of foundation models continues to evolve, Netflix's experience provides a practical roadmap for how these powerful models can be integrated into existing systems to deliver tangible improvements in user experience while managing the complexities of large-scale production environments.

The presentation slides provide additional technical details and visual explanations of the concepts discussed, including specific architecture diagrams and performance metrics that illustrate the effectiveness of the unified and foundation model approaches.

#recommendation #Foundation Models #Personalization #transformer #ranking