As AI engineering rapidly evolves, organizations are adopting new approaches including cross-functional Tiger Teams, specialized evaluations, and agentic applications. This article explores the emerging discipline that's transforming how we build, test, and deploy AI systems.

Tiger Teams, Evals and Agents: The New AI Engineering Playbook

The field of AI engineering is experiencing unprecedented growth, evolving at three to four times the speed of previous technology waves like DevOps and data engineering. As organizations race to implement AI solutions, they're discovering that traditional software development approaches aren't sufficient for the unique challenges of agentic applications. In a recent conversation with Sam Bhagwat, co-founder and CEO of Mastra, we explored the emerging discipline of AI engineering and the organizational structures needed to successfully ship AI-powered systems.

The Evolution of Open Source in the AI Era

Open source communities have always followed a predictable pattern: they begin with enthusiastic tinkerers who experiment with new technologies, then evolve to include production users who rely on these tools in critical business contexts. This evolution requires maintainers to let go of opinionated approaches and adapt to broader user needs.

"Open source communities evolve over time," Bhagwat explains. "In the beginning, it's a lot of people tinkering around with things. If your project gains traction, people start bringing it into their work environments. The first people you get that don't like your thing are usually the people who inherited a project that someone else built with your thing."

For commercial open source companies, this evolution presents a unique challenge: finding people who balance open source generosity with commercial pragmatism. "There are people in open source that are open source purists and have a very difficult time working in a company that has any sort of commercial mission," Bhagwat notes. "There are also commercial type people that are just very 'I win, you lose' kind of people. These people have a hard time working in open source type companies because there's a certain magnanimity where, 'no, we don't want to charge for this.'"

The sweet spot, according to Bhagwat, lies with individuals who embrace both the open source ethos and commercial realities. "You have to find the open source people, but people that aren't too anti-commercial. And you have to find the commercial people who are savvy, but they're like 'you win, I win' kind of people rather than too much in the other direction."

The Emerging Discipline of AI Engineering

AI engineering represents a new discipline that bridges traditional software development with data science. Unlike previous emerging fields, AI engineering is developing at an accelerated pace. "Everything is happening faster this time," Bhagwat observes. "The metric that I look at the most is we see how fast growth is happening in AI projects versus previous kinds of projects. And three or four months of project growth was before is now happening in one month."

This rapid adoption creates both opportunities and challenges. For professionals looking to transition into AI engineering, the field offers a unique advantage: there's a significant unmet need for experienced practitioners. "There's a moment and a period of time where if folks want to transition into them, it's kind of easier because there's this very unmet need that companies are wanting to build these types of applications or to do this kind of engineering, but there's not that many people that have three years of experience," Bhagwat explains.

AI-augmented development is transforming how open source projects are maintained. Bhagwat's team at Mastra has developed specialized agents for various aspects of the development lifecycle:

Agents that convert bug reports from Discord into GitHub issues
Agents that generate reproduction cases from incomplete bug reports
Agents that automatically create changelogs
Third-party agents that comment on pull requests and judge code quality

"We're just heavy Claude Code and internally we're using Composer to do multiple agents at the same time," Bhagwat shares. "My co-founder was thinking about getting a new computer so we can run more parallel coding agents. But we also have built agents for every step of that. You just feel like you put on this superpower suit and you can just get more done."

Evaluations: Ensuring Quality in Non-Deterministic Systems

One of the fundamental challenges in AI engineering is the non-deterministic nature of large language models. Unlike traditional software where identical inputs produce identical outputs, AI systems can produce different valid responses to the same prompt. This requires a new approach to testing and validation.

"We've seen them being, let's say 10X as important in AI engineering as your normal engineering, because the non-determinism of agentic applications, you can't anymore expect that... You could have multiple successes that have different response bodies, and that's not the case when you're building traditional software applications," Bhagwat explains.

Evaluations (evals) in AI engineering take several forms:

Generic evals: These include prompt accuracy, fairness, unbiasedness, toxicity detection, and tool calling accuracy. These are available as off-the-shelf solutions but provide limited value for specific use cases.
Domain-specific evals: These are custom evaluations based on an organization's unique data, domain expertise, and business requirements. "Where you really start getting into high amounts of value for your particular use case is when you are able to write evals that are unique to your business based on data that your organization has that others don't have," Bhagwat emphasizes.

Creating effective domain-specific evals involves a systematic process:

Gather domain expertise: Bring in subject matter experts to provide comprehensive questions and expected answers.
Collect human-created data: Build datasets that include questions, relevant documents, and sample inputs with correct outputs.
Assess baseline accuracy: Determine the current accuracy of your agent against the dataset.
Identify failure modes: Analyze where the agent struggles and why.
Iteratively improve: Refine prompts, context, and system design to address specific failure patterns.

"Typically, these projects have two phases," Bhagwat describes. "The first phase is, can we get a prototype working that you can chat with it? It will give you answers. And then around there is where you start assessing the accuracy of the agent. Basically like, okay, so this agent has 80% accuracy or 85% accuracy. We need it to be 95% accurate or 99% accurate or however you want to score it."

Author photo

Building Agentic Applications

Agentic applications represent a new paradigm in software development, where AI systems act as autonomous agents capable of understanding context, making decisions, and taking actions on behalf of users. The most common use case Bhagwat is seeing involves building agents as interfaces within existing SaaS applications.

"In some ways, we can think that the web is a client in my SaaS app. But maybe I'm a mobile client as well, or multiple mobile clients across iOS and Android and maybe desktop as well. And in some ways, your agents is another client for your APIs," Bhagwat explains.

A typical example involves an HR SaaS platform that observed users exporting CSV data and pasting it into ChatGPT to answer questions. This presented two problems: privacy concerns and the lack of organizational context in general-purpose chat tools. The solution was to build an agent within the SaaS application that could:

Generate reports based on organizational data
Answer HR policy questions by merging salary and policy documents
Provide responses with appropriate organizational context

"This is the modal use case of people building agents," Bhagwat notes. "And something that's really interesting are these sort of customer-facing agents that have access to organizational data and can interact with users in ways to service information that maybe it's just not clear or obvious or easy how to do using the basic functionality that exists in the SaaS app."

The Tiger Team Approach for AI Projects

Shipping AI agents to production requires marrying software engineering rigor with data science's comfort with statistical uncertainty. This hybrid approach has led to the emergence of cross-functional Tiger Teams specifically designed for AI projects.

"What we've seen teams have success with is being able to find folks to work on a project that are able to gather information from different types of people," Bhagwat explains. "So I think there's a few different team archetypes here."

Common Tiger Team configurations include:

CTO-led teams: For high-risk, high-value projects, the CTO often acts as project lead and writes significant portions of the code.
Prototype-to-production handoffs: Initial prototyping may be done by one team, with production development handed off to a specialized Tiger Team with different skill sets.

The key characteristic of these teams is their cross-functional nature, which doesn't map neatly into traditional organizational structures. "This Tiger Team concept though, I think is very important because you do need to pull in people cross-functionally, and it is not going to map into your existing org structure," Bhagwat emphasizes. "Organizations that we see struggling with this are the ones that are more command and control. They have a harder time making Tiger Teams for specific projects that are cross-functional."

Author photo

Practical Implementation: From Prototype to Production

Moving an AI agent from prototype to production requires a carefully staged approach that balances rapid iteration with rigorous validation. Bhagwat outlines a typical progression:

Prototype phase: Build a working version that can answer basic questions and demonstrate functionality.
Accuracy assessment: Evaluate the prototype against a comprehensive dataset to establish baseline performance.
Iterative improvement: Systematically address failure modes by refining prompts, expanding context windows, and adjusting system design.
Confidence building: Continue testing until the agent meets the organization's accuracy threshold for its specific use case.
Staged rollout: Use feature flags to gradually expose the agent to increasing percentages of users, starting with internal beta testers and expanding to broader audiences.

"These don't roll out typically over days. It might be over weeks as you're gaining confidence and you're rolling it out to wider groups of people," Bhagwat notes. "You have to understand what is the risk for your organization of giving incorrect answers, and sometimes that's higher and sometimes that's lower. And so you may have different thresholds of tolerance."

The rollout process often involves developing new metrics for success that extend beyond traditional software metrics. "We're kind of developing some of these terminologies for what is the equivalent of P95 or PN99 in AI engineering, but it's a very new field," Bhagwat explains. "We now have language around P99 and P95 in terms of a response and latency time where we know that you want to optimize not just the median response time, but we want to also optimize the long-tail response time so that a very large fraction of our users have good experiences."

The Cultural Shift: Embracing Discomfort

Beyond the technical and organizational challenges, AI engineering requires a significant cultural shift. Bhagwat emphasizes the importance of embracing discomfort and maintaining enthusiasm for new approaches.

"I'm 37. And I think when you get out of your 20s into your 30s and into your later 30s, and beyond that as well, there can just be a sense of when you see new things, you can react with default skepticism rather than default enthusiasm," Bhagwat shares. "And I think we're engineers, we're naturally skeptical people. And I think that where that can be challenging is that if we lead with our skepticism, to be good in a new field, you need to be okay with being uncomfortable and okay with being kind of bad at this new thing that you're doing."

This discomfort is a natural part of the learning process. "You're going to have the sense of taste of it, like, 'gosh, I'm not very good at this,' and you're going to be upset at yourself. But you have to stick with it and be okay with this period of uncomfortability, and not just reject it because it's new and it's weird and it's different than the thing that you've done before."

For organizations, this means creating environments that encourage experimentation and tolerate failure. "Our CEO keeps shouting about this thing. But there's a lot of reasons why you could choose to be skeptical. But I think if you want, there's a lot of opportunity in being able to be the person that is able to build a new kind of technology, and to be an early adopter and a pioneer in your field or your community or your organization, and figure out how the different pieces fit together."

The Future of AI Engineering

As AI engineering continues to evolve, we can expect several key developments:

Specialized tooling: Development of frameworks and tools specifically designed for building and deploying agentic applications, like Mastra, the open source JavaScript/TypeScript framework Bhagwat co-founded.
Advanced evaluation techniques: More sophisticated methods for testing and validating AI systems, including automated evaluation generation and continuous monitoring in production.
Hybrid team structures: Further refinement of organizational approaches that effectively combine software engineering and data science perspectives.
Domain-specific patterns: Emergence of proven patterns and best practices for common AI application types.
AI-augmented development: Increasing integration of AI tools throughout the software development lifecycle, from design to deployment.

Tiger Teams, Evals and Agents: The New AI Engineering Playbook - InfoQ

Conclusion: Embracing the New Paradigm

The transition to AI engineering represents more than just adopting new tools—it requires a fundamental shift in how we approach software development. The non-deterministic nature of AI systems demands new approaches to testing and validation, while the agentic nature of these applications requires new architectural patterns.

Organizations that successfully navigate this transition will likely adopt several key practices:

Building cross-functional Tiger Teams that combine software engineering and data science expertise
Developing domain-specific evaluations that go beyond generic benchmarks
Implementing staged rollout processes that balance innovation with risk management
Creating cultures that embrace discomfort and encourage experimentation

As Bhagwat notes, "I think for me, I think there's two sort of magical experiences that I've had. One was just the first time I've had a working program running, and I was like, 'This is so cool. I'm having the computer do this.' And the second is just vibe coding in AI engineering and watching what the LLM is doing and being part of this co-creation process."

The future of AI engineering lies in finding the sweet spot between human creativity and machine intelligence—a balance that will continue to evolve as both technologies and practices mature. For practitioners willing to embrace this new paradigm, the opportunities to build transformative systems are immense.

#AI_Engineering #agentic applications #tiger teams #evaluations #Open Source

Tiger Teams, Evals and Agents: The New AI Engineering Playbook

Tiger Teams, Evals and Agents: The New AI Engineering Playbook

The Evolution of Open Source in the AI Era

The Emerging Discipline of AI Engineering

Evaluations: Ensuring Quality in Non-Deterministic Systems

Building Agentic Applications

The Tiger Team Approach for AI Projects

Practical Implementation: From Prototype to Production

The Cultural Shift: Embracing Discomfort

The Future of AI Engineering

Conclusion: Embracing the New Paradigm

Comments