TSGen: Microsoft's AI-Powered Solution to Cloud Troubleshooting Challenges

Microsoft introduces TSGen, an AI system that automatically generates troubleshooting guides from incident data, addressing the scalability and quality issues of manual documentation while enabling faster incident resolution across cloud services.

Transforming Cloud Incident Management Through Intelligent Automation

Microsoft has unveiled TSGen (Troubleshooting Guide Generator), an AI-powered system that automatically creates and maintains troubleshooting guides for cloud services, addressing the persistent challenges of manual documentation in incident management.

The Problem with Manual Troubleshooting Guides

Operating cloud services at scale presents unique challenges for incident management. When issues arise, engineers rely on Troubleshooting Guides (TSGs) to diagnose and resolve problems quickly. However, manual TSG creation and maintenance can occasionally create bottlenecks.

TSGs are often siloed across different platforms, making them difficult to locate during critical incidents. The content itself tends to be inconsistently structured between silos and occasionally incomplete, requiring engineers to interpret ambiguous instructions under time pressure.

Microsoft's internal study examining over 4,000 TSGs mapped to thousands of incidents revealed that while TSGs significantly reduce mitigation efforts when properly maintained, their quality varies dramatically. Engineers surveyed about TSG effectiveness consistently report issues with outdated information, missing steps, and lack of clarity. These quality gaps lead to extended incident resolution times, increased engineer fatigue, and higher operational costs.

Introducing TSGen: Automated TSG Generation @ Scale – Built by AI | Microsoft Community Hub

The AI-Powered Solution

The core technical innovation is the use of an AI system that automatically synthesizes high-quality, structured Troubleshooting Guides (TSGs) directly from historical incident data, rather than relying on manual authoring. TSGen ingests diverse operational signals—such as past IcM incidents identified via monitor IDs or custom Kusto queries—and produces end-to-end, action-oriented troubleshooting workflows within minutes.

This shifts TSG creation from a labor-intensive, error-prone documentation task into an automated knowledge synthesis problem, enabling consistent structure and coverage across services.

A second key innovation is operational scalability with continuous relevance. TSGen is designed not only to generate new TSGs, but to keep them up-to-date as new incidents occur, addressing the chronic issue of stale or incomplete troubleshooting documentation.

The system has already demonstrated practical effectiveness in pilot deployments, with dozens of generated TSGs accepted and published for real on-call usage, showing that AI-generated artifacts can meet production engineering standards rather than serving as drafts or suggestions.

Finally, TSGen explicitly targets dual consumption by humans and AI agents, generating structured outputs that are useful both for on-call engineers and for automated agents involved in incident diagnosis. This positions TSGs as a shared, machine-readable knowledge layer rather than static documents, reducing "tribal knowledge" and enabling faster, more reliable incident response at scale across Microsoft services.

Introducing TSGen: Automated TSG Generation @ Scale – Built by AI | Microsoft Community Hub

TSGen's Five-Step Automated Workflow

TSGen addresses the manual TSG challenge through a sophisticated five-step automated workflow that transforms incident data into executable troubleshooting guides:

1. Collection: Gathers incident data from multiple sources including diagnostic logs, historical tickets, and troubleshooting documentation. This comprehensive data aggregation creates the foundation for intelligent TSG generation.

2. Filtering: Removes noise and irrelevant information from the collected data. Machine learning algorithms identify which incident attributes are most relevant for troubleshooting, eliminating false signals that could lead to incorrect guidance.

3. Core Incident Selection: Identifies representative incidents that exemplify common problem patterns. Rather than processing every incident individually, TSGen selects the most informative examples that capture the essential troubleshooting logic.

4. Data Distillation: Extracts key troubleshooting patterns and actionable steps from the selected incidents. This process analyzes successful resolution paths to identify the critical diagnostic checks and mitigation actions.

5. TSG Generation: Synthesizes the distilled information into structured, actionable troubleshooting guides. The output is a well-formatted TSG that engineers can follow systematically during incident response.

Introducing TSGen: Automated TSG Generation @ Scale – Built by AI | Microsoft Community Hub

Real-World Impact

The shift from manual TSG creation to automated TSG maintenance delivers measurable benefits for incident management operations. Teams using automated TSG maintenance report significant reductions in time-to-mitigation for common incident types. Engineers spend less time searching for relevant documentation and interpreting ambiguous instructions by ensuring that all TSGs have consistent formatting and reliable information, allowing them to focus on complex problem-solving.

Introducing TSGen: Automated TSG Generation @ Scale – Built by AI | Microsoft Community Hub

Industry-Wide Implications for Cloud Operations

TSGen represents a broader trend toward intelligent automation in cloud operations. The challenges of maintaining high service availability while managing complex distributed systems affect organizations across industries. As cloud infrastructure grows in scale and complexity, the volume of potential incidents increases exponentially. Traditional manual approaches cannot keep pace with this growth.

Automated TSG generation offers a scalable solution that improves with the volume of data it processes. Each incident handled by the system contributes to its collection of incident knowledge, creating a positive feedback loop for ever-improving TSGs. This scalability benefit is particularly valuable for organizations operating multiple services or supporting global customer bases.

The technology also democratizes incident management expertise. In traditional models, effective troubleshooting requires deep institutional knowledge that takes years to develop. Automated systems capture and codify this expertise, making it accessible to engineers at all experience levels. This knowledge transfer capability reduces dependency on veteran engineers and accelerates onboarding for new team members.

Key Benefits of Automated TSG Generation

Automated TSG generation delivers multiple strategic advantages for organizations managing cloud infrastructure:

Faster incident resolution reduces service disruptions and improves customer experience
Improved TSG quality through continuous learning ensures troubleshooting guidance remains accurate and comprehensive
Reduced operational costs result from decreased manual documentation maintenance and shorter incident durations
Enhanced engineer productivity allows technical teams to focus on innovation rather than repetitive troubleshooting tasks
Knowledge preservation captures institutional expertise in executable form, protecting organizations from knowledge loss when engineers transition
Scalability enables consistent incident management across growing infrastructure without proportional headcount increases
Data-driven insights from automated systems reveal patterns in incident types and resolution effectiveness, informing preventive measures

Building with AI: Lessons Learned

This iteration was developed in VS Code using Copilot CLI, with Claude models (including Opus 4.6) for implementation support and rapid iteration. The team leveraged AI at every level of development, with the majority of the code created by AI, allowing for faster iteration and development.

Several practical learnings helped the team get better outcomes and avoid rework:

Create a solid plan up front for each major change. The team used "Plan" mode in VS Code to have Claude AI models assist in defining what they wanted to make in a way that AI could leverage. For example, when converting the codebase from NodeJS to Python, they made a new dedicated plan.

Be detailed in the initial description and write down explicit requirements as bullet points. The initial prompt to generate the plan was quite long and focused on getting information into the AI rather than giving it an actual handmade plan. Including specifics like folder structure and functionality requirements proved crucial.

If you already know how you want something to work, state it directly. Specific instructions beat vague intent. Models can produce unexpected solutions, so providing code pointers and specific details for functions helps ensure alignment with goals.

During plan creation, answer follow-up questions with as much context as possible. When the model asks for follow-up, providing extensive information helps ensure the result aligns with expectations.

Read the plan critically and "negotiate" it as you go. Treat the AI like a junior developer and make expectations explicit. After getting a plan, reading it fully ensures miscommunications don't occur.

If a model isn't producing good results, switch models and try again. Bringing in a new model can have the same effect as bringing in a different engineer with fresh eyes, especially given the speed of release for new models.

When the model is missing context, give hints about where to look. Providing file, folder, or component references helps ground the plan in relevant examples.

The Future of Intelligent Incident Management

The evolution of TSG automation points toward increasingly autonomous incident management systems. Current systems like TSGen focus on automating TSG generation and execution for known incident patterns. Future developments will likely expand into autonomous root cause analysis and predictive incident prevention.

Advanced AI agents could execute complex diagnostic workflows without human intervention, escalating only when novel situations arise that require human judgment. Natural language processing capabilities will enable engineers to interact with troubleshooting systems conversationally, asking questions and receiving context-aware guidance.

The integration of reinforcement learning could allow systems to optimize troubleshooting strategies in real-time based on success rates. These systems might automatically adjust their approaches when initial steps prove ineffective, exploring alternative resolution paths intelligently.

Another promising direction involves cross-system learning, where troubleshooting knowledge from one service or organization informs incident management in others. This collective intelligence approach could accelerate the development of effective troubleshooting strategies industry-wide.

The ultimate vision is incident management systems that continuously improve, require minimal human oversight, and prevent problems before they impact customers.

For more information about Microsoft's AI initiatives in cloud operations, visit the Microsoft Foundry Blog.