AWS Introduces Durable Storage Option for ElastiCache for Valkey
#Infrastructure

AWS Introduces Durable Storage Option for ElastiCache for Valkey

DevOps Reporter
4 min read

AWS adds synchronous and asynchronous durability modes to ElastiCache for Valkey, letting teams persist data across AZ failures while choosing between minimal data loss or lower write latency.

Featured image

AWS has announced durability support for Amazon ElastiCache for Valkey, adding a persistent storage layer to what was previously a cache-only service. The feature introduces two durability modes: synchronous replication for minimal data loss, and asynchronous replication for lower write latency with bounded recovery time.

This is a significant shift in how ElastiCache can be used. Until now, the service was designed for ephemeral data that could be rebuilt from a primary source. With durability, AWS is positioning Valkey as a viable option for persistent workloads like AI agent memory, session stores, workflow state, RAG knowledge bases, payment tokenization, and inventory management.

What Changed

ElastiCache for Valkey now offers two durability tiers:

Synchronous Durability writes are acknowledged only after data replicates across at least two Availability Zones. This minimizes data loss but increases write latency since the primary node waits for cross-AZ confirmation.

Asynchronous Durability writes are acknowledged before replication completes. This preserves microsecond-level write latency but introduces a durability buffer of up to 10 seconds. The primary node tracks the age of the oldest unacknowledged write and publishes this as the DurabilityLag CloudWatch metric. If the buffer exceeds 10 seconds, writes are temporarily rejected until the cluster catches up.

As Jules Lasarte and Karthik Konaparthi from AWS explain in the announcement post:

Many organizations find that Multi-AZ replication and automatic failover in ElastiCache meet their resilience requirements, but as customers increasingly adopt ElastiCache as a persistent data store, as well as a cache, data loss becomes a primary concern.

Why It Matters

The core trade-off is straightforward: you can have lower write latency or tighter durability guarantees, but not both. This matters for teams that previously had to choose between ElastiCache's performance and a separate database for persistence.

Consider an AI agent session store. Users interact with the agent in real time, and losing 10 seconds of conversation history after a failure is probably acceptable. Now consider payment tokenization. Losing even a second of transaction state creates operational headaches. The two durability modes let you pick the right trade-off per workload.

Corey Quinn at The Duckbill Group offers a useful reminder in his newsletter:

Once again I am begging you to not confuse "cache" with "primary data store." Once again, you will ignore me, as some lessons can only be learned and internalized via SLA breaches.

The warning is worth internalizing. Durability adds resilience, but it does not turn a cache into a fully-featured database. There are no secondary indexes, no complex query support, no point-in-time recovery in the traditional database sense. If your workload requires those capabilities, you still need a database.

How It Works in Practice

Here is the practical breakdown of when to use each mode:

Use synchronous durability when:

  • You cannot tolerate data loss (financial transactions, inventory counts)
  • Write throughput is moderate and latency can tolerate cross-AZ round trips
  • The workload is read-heavy with occasional writes

Use asynchronous durability when:

  • Write latency is critical (real-time counters, session state, leaderboard updates)
  • Up to 10 seconds of data loss is acceptable during failures
  • You need the lowest possible write latency without giving up all durability

Stick with traditional ElastiCache (no durability) when:

  • Data is derived and can be rebuilt from source
  • You want the lowest cost option
  • Pure caching is the use case (CDN origin, database query cache)

The Valkey GLIDE client library supports automatic retry with exponential backoff, which becomes important when the asynchronous durability buffer fills and writes are temporarily rejected. Make sure your client configuration handles these rejections gracefully.

Key Details

  • Availability: All AWS regions supporting ElastiCache for Valkey
  • Version requirement: Valkey 9.0
  • Engine support: Valkey only. Not available for Redis or Memcached on ElastiCache
  • Pricing: Synchronous and asynchronous durability add cost over the base cache tier. Standard ElastiCache without durability remains the cheapest option

AWS Introduces Durable Storage Option for ElastiCache for Valkey - InfoQ

The MemoryDB Question

On Reddit and in developer discussions, a recurring question is how this overlaps with Amazon MemoryDB, the Redis-compatible in-memory database with durable storage.

The short answer: MemoryDB provides stronger durability guarantees and is designed as a primary database. ElastiCache with durability provides a middle ground between pure caching and a fully durable database. If you need transactional semantics, complex data structures with persistence guarantees, or compliance-grade durability, MemoryDB is still the right choice. If you need a fast cache that does not lose everything on a node failure, ElastiCache with durability fills that gap.

For teams already running Valkey on ElastiCache, this announcement means you can incrementally add durability to specific use cases without migrating to a different service. Start with asynchronous durability on workloads where 10 seconds of data loss is acceptable, monitor the DurabilityLag metric, and evaluate whether synchronous durability is warranted based on actual failure scenarios.

Author photo

Getting Started

  1. Verify your cluster is running Valkey 9.0
  2. Enable durability through the AWS Console, CLI, or CloudFormation
  3. Choose synchronous or asynchronous mode based on your durability and latency requirements
  4. Monitor DurabilityLag in CloudWatch for asynchronous mode
  5. Configure your client (Valkey GLIDE recommended) with retry and backoff policies

The feature is available now in all regions that support ElastiCache for Valkey.

Comments

Loading comments...