Figma Builds In-House Redis Proxy to Hit Six Nines Uptime

Figma has developed FigCache, a custom Redis proxy service that delivers 99.9999% uptime by replacing their fragmented caching infrastructure. The in-house solution addresses connection management challenges, improves observability, and provides a unified abstraction layer over Redis clusters.

Figma has published a detailed account of how it built an in-house Redis proxy service called FigCache, replacing a fragmented caching stack that had become a liability for site availability. The system, described in a post by software engineer Kevin Lin, has been in production since the second half of 2025 and has delivered what the company describes as six nines of uptime across its caching layer.

"Scalability and reliability gaps in our Redis platform were a growing threat to Figma's site availability," Lin wrote in the announcement. The move represents the second major storage rearchitecture Figma has undertaken in quick succession. In 2024, the company completed a nine-month project to horizontally shard its Postgres stack, building a bespoke proxy service called DBProxy to intercept, parse and route SQL queries across physical shards. FigCache extends that pattern into the ephemeral data layer, applying the same philosophy: build a proxy tier that absorbs operational complexity so that application code does not have to.

FigCache architecture

The Problem: Fragmented Caching Infrastructure

The symptoms that motivated the project were familiar to anyone who has run Redis at scale. Connection volumes were creeping towards hard limits. Rapid scale-ups of client services produced thundering herds of new connection requests, saturating Redis I/O and degrading availability. A sprawl of independent client libraries had grown up over time, each with its own observability behavior, making it difficult to diagnose incidents quickly.

Lin writes that the team initially built service-specific workarounds, including a custom client-side connection pooling layer, but concluded these were "isolating Redis outages from top-level site availability" rather than solving the underlying structural problems.

The decision to build rather than adopt an existing open source proxy came down to the limits of what was available. Lin notes that existing solutions shipped with "rudimentary RPC servers that were not capable of extracting full, annotated arguments from arbitrary inbound Redis commands." Without that semantic awareness, Figma could not implement the runtime guardrails it needed, nor define custom commands that the proxy itself could intercept and handle.

The company also needed to support a fragmented existing client base: some services were Redis Cluster-aware, some used TLS, some did neither. A proprietary layer allowed the team to build shims that handled all these variants transparently, including a Redis Cluster emulation mode that presented the proxy to cluster-aware clients as a fake cluster.

"Extending existing open source Redis proxies with custom business logic proved heavyweight and logistically brittle, requiring maintenance of a source code fork that would be difficult to keep in sync with upstream," Lin explained.

Architecture: Frontend/Backend Separation

FigCache itself is a stateless service built on ResPC, a Go library the team wrote to provide an RPC framework over the Redis Serialization Protocol (RESP). The proxy sits between client applications and a fleet of Redis clusters on AWS ElastiCache. Its architecture separates a frontend layer, which handles connection management and protocol-aware command parsing, from a backend layer that manages connection multiplexing and command execution against upstream clusters.

This separation is what makes the system extensible: new behaviors can be introduced at either layer without disrupting the other. One of the more unusual design choices is how FigCache's backend is configured. Rather than static configuration files, the engine tree that governs how commands are routed and processed is expressed as a Starlark program, evaluated at runtime in a virtual machine, which then renders a Protobuf-structured configuration consumed by the server.

This means operators can change routing logic, key-prefix-based rejection rules, and command-type splitting purely through configuration, without redeploying server binaries. The proxy also handles a class of problem that Redis Cluster normally surfaces to clients as an error. Redis Cluster returns CROSSSLOT errors when a pipeline or transaction spans multiple hash slots, since those operations may touch different physical shards. FigCache includes a fanout filter engine that intercepts eligible multi-shard pipelines and executes them internally as a parallelized scatter-gather, dispatching individual commands and aggregating responses before returning them to the client. From the application's perspective, the error never appears.

Implementation and Migration Strategy

First-party client libraries in Go, Ruby, and TypeScript sit alongside the proxy. These are wrappers over existing open source clients already in use in Figma's codebase, which meant the team avoided building a proprietary protocol from scratch. Migrating a service to FigCache was, in the simplest cases, a one-line configuration change to update an endpoint.

The migration strategy was designed to be reversible at every stage. Traffic was shifted service by service, with feature flags allowing instant reversion without code changes or binary deployments. For large workloads such as Figma's main API service, traffic was shifted incrementally across independent domains rather than switched all at once. Before any live rollout, the team ran extensive benchmarks including a weekly distributed stress test on production that surges throughput to an order of magnitude above typical organic peaks.

Benefits and Outcomes

FigCache eliminated the thundering herd connection failures that had contributed to multiple high-severity incidents. Shard failovers, cluster scaling, hardware rotations and OS upgrades are now zero-downtime background operations. The team runs failovers frequently across the entire Redis footprint as a standing exercise of the system's resilience.

Observability across the entire caching stack is now unified, with metrics, logs and traces giving engineers a consistent view of latency, throughput, payload sizes and command cardinality across all workloads. The time to diagnose incidents, Lin writes, has dropped from hours or days to minutes.

Design Choices and Trade-offs

FigCache's Starlark-based configuration system and composable engine tree are the parts of the design most likely to attract interest from engineers facing similar problems. Building a proxy that is transparent to existing clients while being extensible enough to absorb years of future requirements is a hard constraint to satisfy.

The approach has parallels elsewhere in the industry. At lastminute.com, engineers rearchitected a search aggregation system in 2024 to use Redis as an intermediary result store, decoupling supplier search drivers from the aggregation service via RabbitMQ. The goal was similar: reduce coupling, improve scalability, and isolate components from one another's failure modes. Figma's approach goes further by centralizing the Redis access tier itself rather than simply rethinking how data flows into it.

The wider Redis ecosystem has also seen some changes in recent years. In May 2025, Redis returned to open source licensing under AGPLv3 after a year of controversy following its move to the more restrictive SSPLv1 license in March 2024. That shift had prompted the creation of the Valkey fork. Redis 8.0, released alongside the licensing change, includes performance improvements the project describes as up to 87% faster commands and up to 2x higher throughput.

Figma's decision to build an abstraction layer that can swap out the backend storage system looks prudent in that context: Lin notes that FigCache is designed to support alternative backends including AWS MemoryDB and Figma's own Postgres stack behind the same RESP-based interface.

The Build vs. Buy Question

The question of whether to build or buy this kind of infrastructure is one many engineering teams face. Sneha Wasankar, writing on dev.to about Redis caching strategies in production, notes that the choice of cache-aside, write-through, or write-behind patterns often matters less than the reliability of the infrastructure sitting beneath them.

Figma's post is largely an argument that, at sufficient scale, the infrastructure itself becomes the product. The company has open-sourced the ResPC library that underpins FigCache, though the proxy itself remains proprietary. This suggests a recognition that while the foundational RPC framework over RESP could benefit the broader community, the specific business logic and operational experience embodied in FigCache represents competitive advantage.

Whether the approach generalizes beyond Figma's specific combination of languages, deployment patterns and operational history is a question the post leaves open. However, the detailed explanation of their design decisions and the problems they solved provides valuable insights for any organization facing similar challenges with Redis at scale.

The architecture demonstrates a pattern that could apply to other data services: build a smart proxy layer that absorbs complexity, provides consistent abstractions, and enables zero-downtime operations while remaining transparent to application code. As organizations continue to push the boundaries of what's possible with in-memory data stores, solutions like FigCache may become increasingly common in the technology landscape.

#Redis #Caching #Infrastructure #Go #Observability