Google Cloud SREs demonstrate how the AI-powered Gemini CLI, built on Gemini 3, transforms incident response by automating classification, mitigation, root cause analysis, and postmortem generation while maintaining human oversight for safety.
Google Cloud Site Reliability Engineers (SREs) have detailed their internal use of the AI-powered Gemini CLI to dramatically improve incident response times and reliability in critical infrastructure operations. The approach, outlined by Riccardo Carlesso and Ramón Medrano Llamas, integrates intelligent reasoning directly into terminal-based operational tools, reducing Mean Time to Mitigation (MTTM) while keeping SREs in control.
The MTTR vs MTTM Distinction
The SREs emphasize their focus on MTTM rather than traditional Mean Time to Repair (MTTR). While MTTR measures the full repair cycle, MTTM specifically targets how quickly pain can be stopped. This distinction matters because Google's SRE teams operate under extreme pressure with 5-minute Service Level Objective (SLO) acknowledgment requirements and immediate mitigation expectations following pages.
AI-Powered Incident Lifecycle
A typical incident follows four phases: paging, mitigation, root cause analysis, and postmortem. The Gemini CLI assists at every stage:
Paging and Initial Investigation The CLI excels at symptom classification and playbook selection. When an incident triggers, the AI can analyze symptoms and dynamically generate mitigation playbooks—instruction sets that guide agents through production mutations safely. These playbooks include not just commands but also verification steps and rollback procedures.
Currently, human-in-the-loop validation remains essential. As Carlesso and Medrano Llamas note, "A human-in-the-loop is currently required to verify the proposed mitigations. As agent capabilities mature and agentic safety systems advance, this dependency is expected to decrease." The CLI enforces layered safety controls, positioning the AI as a copilot rather than an autonomous operator.
Root Cause Analysis Once initial mitigation stabilizes infrastructure, the focus shifts to identifying the underlying cause. The Gemini CLI can analyze application logic and direct operators to relevant source code sections, accelerating the investigation process.
Automated Postmortem Generation The final phase traditionally involves tedious compilation of timelines, logs, and actions. The Gemini CLI simplifies this through custom commands that scrape conversation history, metrics, and logs to populate CSV timelines, generate Markdown documents, and suggest preventive action items.
Universal Pattern, Custom Implementation
While the example uses some Google-internal tools, the authors emphasize the pattern's universality. Organizations can build similar workflows using the Gemini CLI, MCP servers to connect Gemini to tools like Grafana, Prometheus, and PagerDuty, and custom slash commands that define reusable prompts.
The Virtuous Loop of Self-Improvement
Perhaps most compelling is the feedback mechanism. Postmortems generated by the CLI become training data for future incidents. "The output of today's investigation becomes the input for tomorrow's solution," the authors explain, creating a continuous improvement cycle.
Wen-Tsung Chang, senior infrastructure engineer at Houzz, reinforces the importance of maintaining human accountability: "No matter what stage we are at right now, we should always stay accountable and never give up on critical thinking."
This approach represents a significant evolution in SRE practices, where AI augments rather than replaces human expertise, maintaining safety while dramatically accelerating response times in production environments.

Comments
Please log in or register to join the discussion