Claude Code's Terraform Disaster: How an AI Agent Accidentally Nuked 2.5 Years of Production Data
#DevOps

Claude Code's Terraform Disaster: How an AI Agent Accidentally Nuked 2.5 Years of Production Data

Chips Reporter
3 min read

A developer's attempt to migrate infrastructure using Claude Code resulted in the complete deletion of two websites' production environments, including databases and backups, highlighting critical risks when AI agents are given broad infrastructure permissions.

When Alexey Grigorev decided to consolidate his websites onto a single AWS infrastructure, he turned to Claude Code for assistance. What followed was a cautionary tale about the dangers of over-relying on AI agents for critical infrastructure management, resulting in the complete destruction of 2.5 years of data.

The Setup: Two Sites, One Infrastructure

Grigorev managed two websites: AI Shipping Labs and DataTalks.Club. His goal was to move both to AWS and share the same infrastructure to reduce costs and complexity. He was using Terraform, the popular infrastructure-as-code tool that can create or destroy entire cloud setups with a single command.

The Critical Mistake

The disaster began when Grigorev asked Claude to run a Terraform plan to set up the new Shipping Labs site. However, he forgot to upload a vital state file—a document that contains a complete description of the current infrastructure setup. Without this file, Claude created duplicate resources, setting the stage for catastrophe.

When Grigorev stopped the process midway and uploaded the state file, he assumed Claude would clean up the duplicates and then proceed with the correct setup. This assumption proved fatal.

The Cascade of Destruction

With the state file now available, Claude followed its logical course: it issued a Terraform "destroy" operation to wipe the existing setup before rebuilding it correctly. The problem? The infrastructure description included both websites. In one fell swoop, Claude deleted:

  • The entire production environment for both sites
  • A database containing 2.5 years of records
  • Database snapshots that Grigorev had been counting on as backups

The operator had to contact Amazon Business support, which managed to restore the data within about a day—a fortunate outcome that many organizations might not experience.

Lessons Learned: The Human Factor

In his post-mortem analysis, Grigorev identified several critical failures:

  1. Over-reliance on AI agents: He admitted to granting Claude broad permissions to run Terraform commands without proper oversight
  2. Missing state file: The initial omission created confusion that cascaded into disaster
  3. Assumed context: Grigorev assumed Claude would understand the implications of managing infrastructure for two separate websites
  4. Insufficient safeguards: No delete protections or scoped permissions were in place

The Broader Implications

This incident highlights a fundamental truth about AI agents in DevOps: they lack the contextual understanding that human operators take for granted. Just as you wouldn't give a junior sysadmin unrestricted access to production environments, AI agents require careful scoping and oversight.

Preventive Measures

Grigorev has since implemented several safeguards:

  • Database restore testing: Regular verification that backups can actually be restored
  • Delete protections: Applying safeguards to Terraform and AWS permissions
  • State file management: Moving Terraform state files to S3 storage instead of local machines
  • Manual review: Stopping Claude from running destructive commands and personally reviewing every plan before execution

The incident serves as a stark reminder that while AI agents can be powerful tools for infrastructure management, they require the same careful oversight, permissions scoping, and human judgment that we apply to human operators. In the world of DevOps, context is everything—and AI agents, no matter how sophisticated, still lack the nuanced understanding that comes from years of experience.

The question isn't whether AI will transform infrastructure management, but how we can harness its power while maintaining the safeguards that protect our most critical systems. As Grigorev's experience shows, the cost of getting this balance wrong can be measured in years of lost data and days of emergency recovery.

Comments

Loading comments...