Skyscanner's journey to transform observability across 800+ microservices using OpenTelemetry demonstrates how decoupling instrumentation from vendors and treating platforms as products can reduce technical debt and empower engineers.

Building a Future-Proof Observability Platform: From Siloed Metrics to Context-Rich Insights

In today's complex distributed systems, observability has become a critical capability for maintaining service reliability and performance. Skyscanner's experience in re-architecting their observability platform offers valuable insights for organizations navigating similar challenges. Wayne Bell, former Director of Platforms at Skyscanner, and Dan Gomez Blanco, Principal Observability Architect at New Relic, shared their journey at InfoQ Dev Summit Munich, revealing how moving to OpenTelemetry decoupled instrumentation from vendors while reducing incident rates across their extensive microservices ecosystem.

The Evolution of Observability: From Pillars to Correlated Signals

Traditional observability approaches often treat metrics, traces, and logs as separate pillars, creating silos between different signals. This fragmentation makes it challenging to correlate issues across system components. As Gomez Blanco explained, "If you treat them as pillars, they will generate silos. You understand that you've got upstream and downstream dependencies, you've got a complex distributed system, but you still operate in a way that your service is at the center of it. You forget about the holistic view of a system."

The real-world impact of these silos becomes apparent during incidents. When Skyscanner's website experienced issues affecting travelers' ability to book flights, teams struggled to connect the dots between different telemetry sources. This scenario highlighted the critical need for context-rich observability that can correlate signals across the entire system.

OpenTelemetry emerges as a solution to these challenges by enabling correlated signals rather than isolated pillars. The framework provides standardized ways to capture telemetry data while allowing flexibility in backends and processing. As Gomez Blanco noted, "We're moving away from pillars, and we're going to correlated signals."

OpenTelemetry: Decoupling Cross-Cutting Concerns

At the core of Skyscanner's observability transformation is OpenTelemetry's architectural approach, which decouples the API from the SDK implementation. This separation addresses the fundamental challenge of cross-cutting concerns in telemetry collection.

"Telemetry APIs are like this," Gomez Blanco explained. "Building an API like that, that is a cross-cutting concern, needs to be done very carefully to ensure that it doesn't contain breaking changes, to ensure that it doesn't leak implementation details. That's at the core of OpenTelemetry's API design."

The API/SDK decoupling provides several advantages:

Stability: The API remains stable while implementations can evolve independently
Flexibility: Teams can choose appropriate backends for their specific needs
Standardization: Consistent interfaces reduce cognitive load for developers
Future-proofing: New capabilities can be added without breaking existing integrations

This approach has gained significant traction across the industry. Major projects like Deno, Quarkus, Azure SDK, Elasticsearch client, Kubernetes, Envoy, Istio, and gRPC have all adopted OpenTelemetry natively. "You're using that library, and then you'll basically be able to do whatever you want with it," Gomez Blanco noted.

Platform Boundaries: Shifting from Infrastructure to Application

A critical decision point in Skyscanner's observability journey was defining their platform boundaries. Traditionally, platforms are often viewed purely as infrastructure components—collectors, agents, and pipelines. However, this approach leads to inconsistency in how teams implement observability practices.

"If you run your own observability platform, you'll have your ingest APIs and your APIs to extract that telemetry, to present it in dashboards, alerts, and so on. Then you leave the rest to your users," Gomez Blanco explained. "You're the platform. You have the infrastructure. I'm not going to tell you how to use it. Then what happens is that, yes, you do get autonomy, but you also get inconsistency."

Skyscanner chose a different approach, extending their platform boundary to include application-level configuration. This shift allows teams to maintain consistency while preserving autonomy:

Shared configuration files that define minimal viable telemetry standards
Base Docker images with pre-configured observability components
Internal libraries that bake best practices into the development workflow
Reusable modules for common alerting patterns and dashboards

"When you want to roll out a change, you just can roll it out as a version bump of an internal library," Gomez Blanco emphasized. "It's less of, if you build it, they will come."

Cultural Transformation: Platform as Product

While technical decisions formed the foundation of Skyscanner's observability transformation, the cultural shift proved equally critical. As Bell noted, "Tooling, though, and software tech I think is the easiest part. My experience as a principal engineer has taught me that tooling is the easiest part, and the difficult part is culture."

The initial approach of mandating observability standards faced resistance from engineering teams who viewed the platform as taking away their autonomy. "They'll run systems that you are then giving them, but they'll also keep this one," Bell explained. "The reason they do that is not because they're trying to be bad. It's purely that they understand this one. This one is the system that they have. This one is a new one. I don't know that one. I'm busy."

The breakthrough came from re-framing the platform not as infrastructure, but as a product with engineers as customers. "What if we think about platform as a product?" Bell proposed. "The platform in Skyscanner, which serves to allow builds to happen, deploy, host, root, and observe, and so much more. The list is endless of what it provides. It is a product."

This product mindset led to several key changes:

Starting with why: Connecting observability practices to business outcomes for travelers
Shared outcomes: Collaborating with product teams to define meaningful metrics
Co-creation: Involving engineering teams in defining standards rather than imposing them
Show, don't tell: Demonstrating value through real-world problem solving

"We actually have open conversations with the engineering teams. We get the engineering teams to help us define standards. We write it together," Bell explained. "Some people might think, how does that scale, 860 odd people? There's ways of doing that. You'll have a look at who your biggest customers are in the organization and start to work with them."

Implementation Strategy: From Champions to Advocates

Skyscanner's observability rollout followed a strategic adoption path that transformed skeptics into champions:

Early adopters: Teams with specific pain points who were willing to try new approaches
Observability champions: Formal program that identified and trained advocates across the organization
Organic advocates: Team members who became champions through positive experience
General adoption: Widespread implementation as the new standard

"That's where the scalability starts coming in with 860-plus or 1,000, whatever, people that you're trying to get the message out," Bell noted. "It's coming from within their team."

A critical insight from this process was the importance of "showing, not telling." When an actual incident occurred, the observability team demonstrated their value by diagnosing issues from the "pixel level" up through the entire stack. "Dan showed he cared, because he wants to there to enable the engineers to serve the traveler," Bell explained. "We weren't. There was an issue."

Measuring Success: Quantifying the Impact

The observability transformation delivered measurable results that justified the investment:

20% reduction in repeat incidents: By identifying root causes more effectively
40% reduction in duplicated effort: Through standardized tooling and practices
Accelerated adoption: The pattern was reused for other platform initiatives
Improved SLO ownership: Product teams now own SLOs tied to traveler outcomes

"The key thing in there is repeat incidents," Bell emphasized. "The thing that was hurting us quite a bit, we ended up in a cycle of the same incident happening and then having learning, resolving that, but we weren't getting to the root cause. By rolling out observability, we were able to get to that root cause much more quickly and eradicate that from ever happening again."

Future Directions: Extending Observability to New Domains

Looking ahead, Skyscanner's observability platform continues to evolve with emerging technologies:

GenAI integration: Monitoring model drift and AI system performance
Data observability: Bridging the gap between online and offline worlds
Continuous profiling: Deeper insights into application performance
Enhanced semantic conventions: Standardizing descriptions of complex systems

"This project, in particular, landed at the right time for us," Bell noted. "Because if you look at the date, it's 2020. Something happened around 2021, 2022, starts with GNs and I. Basically, in there, it's given us the foundation to be able to start to think about model drift in LLM."

For organizations considering similar transformations, the key takeaways are clear:

Your platform is your product: Engineers are your customers, not just users
Open standards matter: Decoupling APIs from implementations provides flexibility and stability
Culture drives adoption: Technical solutions alone cannot overcome cultural resistance
Show value through action: Demonstrate benefits with real-world problem solving
Measure and iterate: Establish clear metrics to track progress and identify areas for improvement

As organizations continue to navigate increasingly complex distributed systems, Skyscanner's observability journey offers a blueprint for building platforms that not only provide technical capabilities but also empower engineers to deliver greater value to their end users.

#Observability #OpenTelemetry #Microservices #platform #Culture

Building a Future-Proof Observability Platform: From Siloed Metrics to Context-Rich Insights

Building a Future-Proof Observability Platform: From Siloed Metrics to Context-Rich Insights

The Evolution of Observability: From Pillars to Correlated Signals

OpenTelemetry: Decoupling Cross-Cutting Concerns

Platform Boundaries: Shifting from Infrastructure to Application

Cultural Transformation: Platform as Product

Implementation Strategy: From Champions to Advocates

Measuring Success: Quantifying the Impact

Future Directions: Extending Observability to New Domains

Comments