Booking.com's 20-year journey from Perl and MySQL to AI-powered platforms reveals the messy reality of scaling machine learning infrastructure and the importance of unified command centers.
At QCon London 2026, Jabez Eliezer Manuel, Senior Principal Engineer at Booking.com, presented "Behind Booking.com's AI Evolution: The Unpolished Story," offering a candid look at the company's 20-year technological transformation. The presentation traced Booking.com's journey from its early days in 2005 to its current AI-powered platforms, highlighting the challenges, failures, and breakthroughs that shaped their data-driven DNA.
The 2005 Starting Point
Manuel began by setting the context of 2005, when the Motorola Razr V3 dominated mobile phones, Web 2.0 was emerging, and Booking.com was nine years old. The company had just launched its initial A/B testing experiments, running over 1000 experiments in parallel with 150,000 total experiments. Despite a success rate of less than 25%, Manuel emphasized that "the goal wasn't to be right; it was to learn fast." These early experiments built the foundation of Booking.com's data-driven culture.
Data Management: From Single Database to 6800 Instances
The company's original tech stack relied on Perl libraries and MySQL, featuring asynchronous replication and commercial support. In 2005, they operated with a single master database. By 2020, this had grown to approximately 6800 database instances. Their MySQL setup was distinctive—eschewing specialized hardware, stored procedures, UDFs, database views, and cache layers.
Their "secret sauce," as Manuel described it, involved smaller databases with a 2TB limit that fit in NVMe solid-state drives, achieving point queries in under 350 microseconds. This model worked until data volumes became unmanageable.
The Hadoop Era and Its Sunset
To address scaling challenges, Booking.com adopted Apache Hadoop around 2011, deploying two on-premise clusters with approximately 60,000 cores and 200 PB of hard disk space. Hadoop powered their machine learning pipeline for years until systemic issues emerged. From a machine learning scientist's perspective, problems included "noisy neighbors" where one bad query could clog an entire cluster, lack of GPU support, and capacity issues causing overloads and outages during peak times.
By 2018, the decision was made to sunset Hadoop, but the migration process took approximately seven years. The five-phase strategy included mapping their entire ecosystem, analyzing usage to reduce scope, applying Google's PageRank algorithm, migrating in waves, and finally phasing out Hadoop. Manuel identified a unified command center as the key to their successful migration.
Machine Learning Engineering Evolution
Booking.com's machine learning stack evolved from Perl libraries and MySQL in 2005 to agentic systems by 2025. The journey included Apache Oozie with Python, Apache Spark with MLlib, H2O.ai, deep learning, and generative AI.
Manuel highlighted 2015 as a pivotal year when Booking.com solved two core problems: real-time predictions using online inference at scale, and feature engineering for training and inference. As of 2024, their current machine learning inference platform handles over 480 machine learning models, processes 400 billion predictions per day, and maintains latency under 20 milliseconds.
Domain Intelligence: Four Specialized Platforms
Manuel discussed four domain-specific machine learning platforms:
GenAI applications include trip planning, smart filters, and review summaries. Content Intelligence serves as a machine learning content hub for image and review analyses, plus text generation, with use cases like detailed hotel content. Recommendations personalize content display for customers.
The fourth platform, Ranking, presented the most complex challenge. Booking.com faced a three-way optimization problem involving choice and value, exposure and growth, and efficiency and revenue. Their 2005 ranking formula was a simple function incorporating bookings, views, and a random number. Attempts to evolve the formula with factors like cancellations, distance-based ranking, room availability, and hotel impressions proved difficult.
When they tried to replace the ranking formula with machine learning, they discovered it was "undefeatable" due to infrastructure limitations. Their experiments typically ran for two to four weeks, but they needed faster iteration. They adapted their A/B testing to include interleaving, where 50% of each experiment set was interwoven into a single test, allowing more variants with less traffic. This led to a strategy of preselecting with interleaving and validating with A/B testing.
Orchestration and the Future
The presentation concluded with how domain-specific platforms are now unified through an orchestration layer, representing the culmination of Booking.com's AI evolution. Manuel's "unpolished story" revealed that technological transformation is rarely linear or smooth—it involves false starts, long migrations, and continuous learning from failures.
The journey from a single MySQL database to 6800 instances, from Hadoop to modern ML platforms, and from simple ranking formulas to sophisticated AI systems demonstrates that building AI capabilities at scale requires not just technical expertise but also organizational resilience and a willingness to embrace imperfection in pursuit of progress.

Comments
Please log in or register to join the discussion