Terraform Stacks and the Myth of Contained Blast Radius

While Terraform Stacks promise modular infrastructure with contained failure domains, introducing runtime dependencies between components exposes a fundamental gap in the model. The distinction between infrastructure graph dependencies and distributed system health dependencies creates a blast radius that Stacks cannot fully contain, forcing architects to re-evaluate their component boundaries.

I’ve found that Terraform Stacks works really well as long as the dependency chain is purely infrastructure. One component produces outputs, another consumes them, and everything converges deterministically. The problems start as soon as I introduce a runtime dependency — where a downstream component depends on an upstream application actually being alive and serving traffic. At that point, the model shifts from "infrastructure graph" to "distributed system health," and Terraform Stacks doesn’t always communicate that distinction clearly.

This matters because one of the main reasons I’m using Terraform Stacks in the first place is to contain blast radius. I want failures to be scoped, attributable, and localized to the components where they actually occur. In my case, I’m working with a stack that looks like this: global → regional-stamp → api. The intent is straightforward. The global component lays down shared foundations. The regional-stamp component provisions the per-region runtime, including an Azure Container App that hosts my service. The api component depends on regional-stamp, but not because it needs a subnet ID or a resource group name—it depends on the application itself being healthy and reachable.

That distinction turns out to be the entire point of the problem.

The Expectation

When I model this as separate stack components, I’m making an explicit architectural statement: these things are independent. The global foundation should be stable. The regional stamp should be independently deployable. The API layer should be able to fail without bringing down the entire stack.

The Terraform Stacks model reinforces this expectation. Components have their own state, their own lifecycle, and their own failure domains. If the API component fails to apply, the regional stamp should remain untouched. If the regional stamp has an issue, the global foundation should be unaffected.

This is the promise of contained blast radius: failures are local, recovery is scoped, and the system remains resilient because its components are truly independent.

The Reality of Runtime Dependencies

The problem emerges when the dependency isn’t just about resource attributes but about service health. In my architecture, the api component needs to know that the Azure Container App in regional-stamp is actually running and serving traffic before it can proceed with its own configuration.

This isn’t a Terraform dependency in the traditional sense. It’s not about passing a container_app_url output variable. It’s about verifying that a service is operational, which requires health checks, readiness probes, or API calls. Terraform Stacks, by design, doesn’t handle this. It manages infrastructure state, not service state.

When I encode this dependency in my stack configuration, I’m creating a coupling that the Stacks model doesn’t fully account for. The api component’s success becomes contingent on the regional-stamp component’s runtime health, not just its infrastructure output.

The Blast Radius Expansion

Here’s where the blast radius myth breaks down. If the regional stamp’s Container App fails to start — perhaps due to a misconfigured environment variable or a dependency on an external service — the api component will also fail. But the failure isn’t contained within regional-stamp. It propagates to api because of the runtime dependency.

Worse, the failure mode is opaque. Terraform will report that the api component failed to apply, but the root cause is in regional-stamp. The blast radius has expanded beyond the component where the actual error occurred.

This violates the core principle of contained blast radius. Failures should be scoped to the component that caused them. Instead, we get cascading failures that are difficult to diagnose because the dependency is implicit in the runtime relationship, not explicit in the infrastructure graph.

The Architectural Implications

This isn’t just a Terraform Stacks limitation — it’s a fundamental challenge in modeling distributed systems with infrastructure-as-code. The moment you introduce runtime dependencies, you’re no longer dealing with a pure infrastructure graph. You’re dealing with a distributed system where component health is interdependent.

Terraform Stacks excels at managing the former. It can handle complex dependencies between resources, ensure proper ordering, and maintain state consistency. But it cannot manage the latter. It cannot verify that a service is healthy, only that its infrastructure exists.

This means that architects using Terraform Stacks need to be explicit about what kind of dependencies they’re modeling. If a component depends on another component’s runtime health, that dependency belongs outside the Stacks model — perhaps in a separate orchestration layer or a post-deployment validation step.

Practical Workarounds

In my case, I’ve had to restructure my approach. Instead of having the api component depend directly on the regional-stamp component’s runtime health, I’ve moved that validation into a separate process. The regional-stamp component now includes a health check endpoint that the api component can poll independently.

This separation preserves the blast radius containment. If the regional-stamp component fails, the api component’s infrastructure remains intact. The api component may not be fully functional, but its deployment doesn’t fail because of a runtime dependency.

I’ve also started using Terraform Stacks’ depends_on attribute more judiciously. It’s tempting to use it to encode runtime dependencies, but that’s a misuse of the feature. depends_on should only be used for infrastructure dependencies — when one resource truly cannot be created without another existing first.

The Broader Pattern

This issue isn’t unique to Terraform Stacks. Any infrastructure-as-code tool that models dependencies as a graph will face the same challenge when runtime dependencies enter the picture. The graph model is excellent for infrastructure, but it’s insufficient for distributed systems.

The solution isn’t to abandon the graph model but to recognize its limitations. Infrastructure-as-code should focus on what it does best: managing resources and their relationships. Runtime dependencies and service health belong in a different layer — perhaps a service mesh, a configuration management tool, or a custom orchestration system.

Recommendations for Architects

If you’re using Terraform Stacks or a similar tool, consider these guidelines:

Be explicit about dependency types: Distinguish between infrastructure dependencies (resource attributes) and runtime dependencies (service health).
Contain runtime dependencies: Keep runtime dependencies outside the infrastructure graph. Use separate validation steps, health checks, or orchestration tools.
Design for failure isolation: Ensure that a failure in one component doesn’t cascade to others through implicit runtime dependencies.
Monitor component health independently: Use tools like Azure Monitor, Prometheus, or custom health endpoints to verify service availability post-deployment.
Document the boundaries: Make it clear which dependencies are managed by Terraform Stacks and which are handled elsewhere.

Conclusion

Terraform Stacks is a powerful tool for managing complex infrastructure with contained blast radius, but its model breaks down when runtime dependencies enter the picture. The distinction between infrastructure graph dependencies and distributed system health dependencies is critical, and architects must design their component boundaries accordingly.

By recognizing this limitation and separating runtime dependencies from infrastructure dependencies, we can preserve the benefits of contained blast radius while building resilient distributed systems. The key is to use the right tool for the right job: Terraform Stacks for infrastructure, and dedicated orchestration tools for runtime health.

For more information on Terraform Stacks and best practices, visit the official Terraform Stacks documentation. For Azure Container Apps health checks, refer to the Azure Container Apps health probes documentation.

#Infrastructure #DevOps #Cloud