AWS Resilience Hub’s next generation adds an AI‑powered failure‑mode engine, modular policy composition, automated dependency discovery and organization‑wide reporting, giving SRE teams a unified way to define, test and prove resilience across hundreds of applications.

AWS Resilience Hub 2.0 – A New Chapter for SREs

Service update

AWS announced that Resilience Hub 2.0 is now generally available in all commercial Regions where the service previously existed. The upgrade introduces five core capabilities:

Modular resilience policies – rather than a single preset, teams assemble policies from reusable requirements such as SLO, multi‑AZ/Region DR, and data‑recovery objectives.
Business‑level application model – a system represents a business application, user journeys capture critical end‑user paths, and services map to deployable units (CFN stacks, Terraform state, EKS namespaces, etc.).
Generative‑AI failure‑mode assessment – an LLM‑backed engine evaluates each service against the selected policy, the AWS Well‑Architected Framework and the Resilience Analysis Framework, then returns concrete failure‑mode findings and remediation steps.
Automatic dependency discovery – DNS query‑log analysis uncovers hidden AWS, internal and third‑party endpoints, surfacing cross‑region calls or undocumented SaaS dependencies.
Organization‑wide reporting – through AWS Organizations integration, a delegated admin can view compliance, progress and risk scores for every account from a single console.

Pricing moves to a service‑based model: each month includes two free failure‑mode assessments per service, with optional paid automated dependency scans. The full pricing matrix is available on the AWS Resilience Hub pricing page.

Use cases

1. Enterprise‑scale SRE governance

A multinational bank runs 300+ microservices across 15 accounts. Before Hub 2.0 each team defined its own DR targets, making audit trails noisy. By publishing a global multi‑Region DR policy (99.95 % SLO, 15‑minute RTO, 5‑minute RPO) and attaching it to every system, the compliance team now has a single source of truth. The organization‑wide dashboard shows which services meet the policy, which are overdue, and why.

2. Rapid onboarding of new workloads

A fintech startup spins up a new trading engine every quarter. Using the Create system → Create service wizard, the team imports a Terraform state file, selects the pre‑built “low‑latency trading” policy, and runs an AI assessment. Within minutes the engine receives a topology map, a list of hidden third‑party market‑data endpoints, and a set of failure‑mode recommendations (e.g., enable cross‑AZ replication for the order‑book DynamoDB table). The whole process replaces a weeks‑long manual review.

3. Incident post‑mortem automation

After a regional outage, the SRE team runs a dependency discovery scan on the affected services. The scan reveals an unexpected call to a legacy on‑prem API that timed out during the incident. The AI assessment automatically tags this as a high‑severity failure mode and suggests adding a circuit‑breaker pattern. The findings are exported to the incident‑management tool via the Resilience Hub API, closing the loop between detection and remediation.

Trade‑offs and considerations

Aspect	Benefit	Potential drawback
Modular policies	Fine‑grained control; reuse across teams	Requires governance to avoid policy sprawl
AI‑driven assessments	Faster insight, natural‑language explanations	Recommendations depend on the quality of the underlying model; may need human validation
Automatic dependency discovery	Surfaces hidden calls without code changes	Relies on VPC DNS query logs; services that bypass DNS (e.g., static IP calls) may be missed
Organization‑wide reporting	Single pane of glass for compliance	Delegated admin must manage cross‑account IAM roles and service‑linked roles correctly
Service‑based pricing	Pay only for assessments you run	High‑frequency assessment workloads could increase cost; budgeting needed

When adopting Hub 2.0, teams should start with a pilot system to calibrate policy definitions and AI assertion tuning. The console allows you to edit assertions—the prompts that guide the LLM—so you can improve accuracy for domain‑specific failure modes (e.g., financial‑transaction consistency). Over time, the policy library can be versioned, and older versions can be retired via the migration APIs that convert legacy assessments into the new model.

Getting started

Create an invoker IAM role with read‑only access to your resources (or enable the service‑linked role if you use AWS Organizations). See the Resilience Hub User Guide – Prerequisites.
Define a policy in the console – choose requirements that match your business needs.
Create a system and add one or more services, linking them to CloudFormation stacks, Terraform state files, or EKS namespaces.
Enable dependency discovery if you want hidden endpoints mapped.
Run a failure‑mode assessment and review the generated topology and findings.
Iterate – mark findings as resolved or irrelevant, adjust assertions, and re‑run assessments as your architecture evolves.

For a step‑by‑step walkthrough, refer to the official AWS Resilience Hub documentation. If you already use Resilience Hub, the migration APIs are described in the Migration section of the guide.

Closing thoughts

Resilience Hub 2.0 blends structured policy engineering with generative AI, giving SREs a repeatable path from intent to evidence. The service does not eliminate the need for human judgment, but it reduces the friction of discovering hidden dependencies and articulating failure modes. By centralising policy definition and reporting at the organization level, teams can finally prove that their applications meet the resilience targets that the business requires.

Give the new experience a try in the Resilience Hub console and share feedback on AWS re:Post or through your usual support channels.

#AWS #SRE #Generative AI #resilience #Cloud

AWS Resilience Hub 2.0: Generative‑AI‑Driven SRE Toolkit Now Generally Available