Engineering Stable, Secure and Scalable Platforms: A Conversation with Matthew Liste

Platform engineering requires balancing stability, security, and scalability while managing limited resources and making difficult tradeoffs for end users and developer clients.

In this podcast, Michael Stiefel spoke to Matthew Liste about building and managing software platforms. Platform services act as the basis for application development, and must always be stable, secure, and scalable. Scaling these systems is particularly difficult because unknown resource contention often causes them to break. Using customer journeys, one can pinpoint the places where the system is particularly at risk. Platform engineering also requires managing limited resources, and making difficult tradeoffs about which functionality should be implemented. The discussion also highlighted how artificial intelligence can increase the speed of development and thus increase risk, and how it interferes with the development of junior engineers who typically learn from basic tasks that now can be done by artificial intelligence. Nonetheless, platform engineering is still responsible for maintaining the stability, security, and scalability of the platform.

Key Takeaways

Platform services, which are the basis for application development must always be stable, secure, and scalable (the 3 Ss).
Scaling a system is particularly difficult, as unknown resource contention often breaks these systems when scaling.
Agentic AI does not change the nature of software development, but it does increase the speed of change. You still have to supervise the changes as before, and maintain the 3 Ss, but the degree of risk increases. Agents make mistakes faster than humans.
Observability and monitoring platforms must increase their speed, which means that they must also use agentic AI. This is similar to the use of AI to combat cyber threats generated by AI.
Using customer journeys is an effective means to measure system reliability and functionality. Customers here are both end-users and developers. An example of a customer journey is "Can I pay with my credit card". Evaluating how a system failure affects a journey illustrates where the system is particularly at risk.
The use of artificial intelligence to do basic coding tasks interferes with the development of junior engineers who typically learn their craft from these tasks. This is an unsolved problem in the apprenticeship of new developers.
Platform engineering requires managing limited resources and making difficult tradeoffs for end users and developer clients. One critical issue is whether to be an early or late adopter of technology, although the existence of open source software helps this decision. It is necessary to have the discipline to reject narrow custom requests for single clients.

The Apprenticeship of Systems Engineering

Matthew Liste's journey into systems engineering began with a childhood fascination with computers. At eight years old, he played chess against a mainframe at the University of Oslo, an experience that sparked his lifelong passion for building and creating with technology.

His path mirrors what many experienced engineers describe - a journey of tinkering, experimentation, and gradual learning. "System engineering is in an apprenticeship no different than any other craft," Liste explains. "You get good at a craft by learning from others, from making mistakes and gradually understanding what great looks like."

This apprenticeship model faces a significant challenge in the age of AI. Junior engineers traditionally learn by doing basic tasks - the "stupid little things" that gradually become more complex. But when AI handles these foundational tasks, where do new engineers gain their experience?

"How do I become a senior developer if I never was a junior developer?" Liste asks. "Do we have a pipeline problem and do we end up not being able to have that person do that job?"

The Three S's: Stability, Security, and Scalability

The core responsibility of platform engineering is maintaining three non-negotiable principles: stability, security, and scalability. These aren't just technical requirements - they're business imperatives, especially in industries like financial services where system failures can cost millions of dollars.

Liste describes platforms as the foundation upon which applications are built. "I build platforms for other engineers that use them in turn to deliver business software to whatever they serve." This means platform engineers must think systemically, understanding not just their component but how it fits into the larger ecosystem.

Managing Risk and Learning from Failure

Risk management is central to systems engineering. Liste references Barry Boehm's spiral model of software development, which emphasizes evaluating risk at each stage and making decisions based on that assessment.

In practice, this means thinking about different environments: engineering candidates, development candidates, and production candidates. Each represents a different level of validation and confidence before something goes live.

Customer journeys provide a powerful framework for measuring system reliability. Instead of abstract metrics, teams focus on concrete user experiences: "Can I pay with my card?" or "Can I look at my statement?" These journeys reveal where systems are most vulnerable.

The Challenge of Scaling

Scaling is where many systems fail, often in unexpected ways. "Scaling a system is particularly difficult, as unknown resource contention often breaks these systems when scaling," Liste notes. This could be network contention, CPU contention, memory contention, or dependencies downstream that aren't obvious until failure occurs.

Anticipating scale is crucial. "If you want your product to be very successful, if your product is very successful, guess what? You're going to get more customers, you're going to drive up scale. And so you have to have built into your systems how they will deal with scaling."

Balancing Technical Perfection and Customer Experience

Platform engineering involves constant trade-offs between technical perfection and customer experience. You cannot have both perfect technical outcomes and perfect customer experiences simultaneously.

For example, with credit card authorization, systems could be incredibly fine-grained about every attribute, but this would lead to more declined charges and a worse customer experience. Instead, teams use heuristics and models to balance risk and usability.

This extends beyond technical resilience to include process and people resilience. If online systems fail, customers can call a human representative - that's part of the system's overall resilience strategy.

Managing Developer Customers

Platform engineers serve two types of customers: end-users and developer clients. Managing developer expectations presents unique challenges, particularly around resource constraints and prioritization.

Liste describes a framework for decision-making: "What adds the most value to the most developers in the least amount of time that cost me the least to maintain?" This helps teams avoid being too early or too late with new technologies.

He uses an analogy of a puppy versus a dog: "You can fall in love with a puppy, but are you ready to care and feed for it for years and walk it and do all things? And if the answer's no, well, then it's probably not time to bring it into the house."

The Role of Culture

Culture is perhaps the most important aspect of platform engineering. "Great culture builds great teams, and great teams build great products," Liste emphasizes. This culture must balance autonomy with safety, allowing teams to make decisions while preventing catastrophic mistakes.

One innovative approach is the "Developer Zero" concept - a team that consumes platforms just like any other developer, providing feedback on documentation, APIs, and user experience without insider knowledge.

The Impact of Agentic AI

Agentic AI changes the speed of operations but not the fundamental nature of systems engineering. "I don't view agentic any different from a human or cultural perspective than a senior developer overseeing junior developers," Liste explains.

What changes is velocity. AI can make mistakes faster than humans, which means observability and monitoring systems must also operate at AI speeds. This creates an arms race similar to cybersecurity, where defensive AI must match offensive AI capabilities.

The Invisible Nature of Good Engineering

Perhaps the most challenging aspect of platform engineering is that good work is invisible. "Your job is to be invisible. And if you do a really good job, no one ever knows you exist," Liste notes.

This invisibility can be frustrating when stakeholders don't appreciate the complexity involved or make unrealistic demands. "They trivialize the complexity that goes into building a system and they over trivialize it and then say, 'Well, that cannot be that difficult.'"

Conclusion

Platform engineering remains a critical discipline in the age of AI and rapid technological change. While tools and technologies evolve, the fundamental principles of stability, security, and scalability endure. The challenge lies in maintaining these principles while adapting to new paradigms and managing the expectations of both end-users and developer customers.

The future of platform engineering will likely involve more AI-assisted development and operations, but the human elements - culture, risk management, and the art of balancing competing priorities - will remain essential. As Liste's experience demonstrates, successful platform engineering is as much about people and processes as it is about technology.

#Platform Engineering #AI #Stability #Security #Scalability