How Meta Scaled One UI System Across 10,000 Internal Tools
#Dev

How Meta Scaled One UI System Across 10,000 Internal Tools

Backend Reporter
10 min read

Cindy Zhang's account of XDS, the cross-design system that grew from two engineers into the foundation for over 10,000 internal tools, is really a story about distributed coordination. The hard parts weren't the components. They were managing community contributions at scale, refactoring a monorepo without breaking thousands of surfaces, and surviving a project cancellation.

Most design system talks stop at the components. The interesting engineering happens after the component library exists, when a small team has to keep a single source of truth working for thousands of contributors who all want something slightly different. That is the actual subject of Cindy Zhang's QCon San Francisco presentation on XDS, the unified UI system that now powers more than 10,000 internal tools at Meta, maintained by a team of roughly ten engineers.

The scale numbers frame the problem well. XDS has over a million imports across Meta's codebase. More than 95% of product teams use it to build internal tools. Internal tooling updates account for half the volume of Meta's web codebase. And all of this runs on a monorepo where everything is pinned to the latest version, so there is no graceful per-consumer upgrade path. A change ships everywhere at once.

Featured image

The problem: fragmentation with no shared substrate

Six years ago, internal tools at Meta were built on components designed for Facebook itself. Those components carried assumptions that don't hold for operational tooling: low information density, social-media layout patterns, and no first-class support for the complex tables, search, and filtering that tools actually need. Worse, the shared component library was the same one powering external products. Any change an internal team wanted carried real risk to consumer-facing surfaces, so internal needs were perpetually deprioritized.

The consequence was duplication. Teams across more than 100 organizations kept rebuilding the same tables and search components because they had no visibility into each other's work. This is a familiar distributed systems failure mode, just expressed in organizational terms. Without a shared substrate and a discovery mechanism, independent actors converge on redundant solutions and the total cost compounds.

Meta's environment also shaped what a fix could look like. There is a strong internal hacker culture where any engineer can spin up a tool with few guardrails, often during a hackathon or internship. Tooling is built in-house rather than purchased. A successful system had to fit that move-fast culture rather than fight it, and it had to be deliberately decoupled from the external component library to remove the risk that had paralyzed internal changes.

The solution: ship something real, then make it visible

XDS started as a grassroots effort, two engineers and four designers working part-time. A prior attempt at an internal design system had already failed, and Zhang is direct about why: it never escaped the design phase, burning its runway trying to craft the perfect process. The XDS team took the opposite approach and built over 100 components in a single half. The lesson she draws is blunt. Don't get bogged down chasing perfection, just ship it.

The adoption strategy is the part worth studying, because it maps cleanly onto how you bootstrap any system that needs network effects to survive. They solved hard, visible problems first. They migrated PowerSearch, a complex filtering component used across a few hundred tools, and improved its accessibility along the way. Because those tools shared the component, the migration instantly put XDS into production in hundreds of places without each team lifting a finger. They added net-new capabilities tools didn't have before, like dark mode and theming, so adoption brought a carrot rather than just a tax. And they piloted on a real, well-trafficked tool their own team owned, an if-this-then-that workflow builder called Butterfly with thousands of monthly users, to prove the system worked end to end on something with genuine complexity.

The growth curve that followed contains an uncomfortable trade-off. XDS took two years to overtake the previous system, FDS. The old system's usage stayed flat while XDS climbed, but the crossover took far longer than most organizations would tolerate. Zhang's framing is pragmatic: if you only have a six-month mandate, find a way to extend it, because persistence past the crossover point is the whole game.

Scaling the contribution model

Once a system is used everywhere, three scaling problems appear at once, and each one is a consistency problem in disguise.

The first is throughput. Every team needs updates for their own use cases, and a ten-person team cannot be the write bottleneck for the entire company. The answer was a community contribution model where product teams commit changes centrally. Community contributions now make up more than half the commits in XDS, effectively doubling the team's capacity. Each half the system processes changes from over 150 contributors delivering roughly 45 change sets per week, all of which the core team still reviews.

That review load is the catch. A contribution model is not free; you trade write throughput for review throughput and coordination overhead. Meta manages it by pushing structure to the edges. They use their own automation tool to enforce visibility rules on every change. They built a structured support intake form because unstructured requests generated endless back-and-forth. They wrote explicit API guidance and encoded much of it as custom ESLint rules, so contributors get guardrails directly in the editor instead of having to find documentation. Lint rules as a contribution-scaling mechanism is a genuinely good pattern: the guidance becomes executable, which means it stays current and doesn't depend on anyone reading it.

The second problem is safe propagation. With a million call sites and a monorepo running one version, any visual or behavioral change can ripple into thousands of surfaces simultaneously. The team's defense is layered. Comprehensive component examples allow manual evaluation in isolation. Screenshot tests, generated for each example through an internal end-to-end framework, catch visual regressions. Accessibility specification tests cover interactive behaviors. But Zhang is honest that tests don't catch everything, and she offers a memorable example. Adding a thousand CSS variable declarations to a div seems harmless, but in one tool it overloaded Chrome's memory and caused browser crashes. Some failures only surface at the intersection of a specific change and a specific consumer, which is exactly the class of bug that distributed systems engineers learn to fear.

For larger changes, they wrap rollouts in Gatekeeper, an internal A/B framework that gradually targets user groups and, critically, acts as a kill switch when something goes sideways. This is feature-flagging applied to UI infrastructure, and the reasoning is identical to flagging a risky backend deploy: make the blast radius adjustable and make rollback instant.

Building and Scaling UI Systems for Internal Tools at Meta - InfoQ

Codemods as the migration primitive

The third problem is API evolution. The team doesn't always get an API right, and in a monorepo you cannot leave old and new versions coexisting indefinitely. Their primary tool is the codemod, a script that mechanically rewrites code across the entire codebase.

The concrete example is instructive. Around 2021 the accessibility team flagged that internal tools had too many headers. The root cause was an API design problem: XDS text exposed heading types whose smaller variants looked like bold text, so builders used headings to emphasize text. That polluted the page's landmark structure and made navigation harder for assistive technology users. The fix was to separate regular text from headings, forcing builders to be intentional about landmarks.

Rather than deprecating their most-used component and littering the codebase with deprecation flags, the team added the new heading component alongside the old API, wrote a codemod to map every existing usage to the correct new type, and used the same migration to convert small headings back into text. The original accessibility problem and the API change were resolved in one pass.

For anyone wanting to build these, the toolchain is open. jscodeshift is the standard library for AST-based codemods, AST Explorer lets you inspect the JavaScript AST while writing transforms, and Zhang's favorite approach is ESLint fixers because they double as lint rules. That last point is the elegant part: a fixer migrates existing usages while the lint rule stems new ones, so you converge on the new API from both directions instead of fighting a moving target.

The talk also addresses AI codemods, which trade determinism for reach. Traditional codemods rely on static AST analysis and are predictable but limited. LLM-based codemods can read context across multiple files and perform more complex migrations with less upfront analysis, but they are non-deterministic and demand careful review. The honest framing of that trade-off is welcome; a non-deterministic transform of a million call sites is not something to run unsupervised.

The best migration, of course, is the one you never have to do. The team's API design heuristics aim for extensibility up front: batch large sets of optional features behind helpers to keep signatures contained, and avoid boolean flags for things that aren't inherently boolean, preferring variant enums that can expand later. Booleans don't compose; the moment you need a third state, you're stuck adding a second boolean and encoding invalid combinations. An enum leaves room to grow.

The organizational failure mode

The most striking section has nothing to do with code. In April 2023, during Meta's layoffs, the team learned their project was canceled, even as adoption was still climbing. The reason exposes a real risk in grassroots infrastructure. Visibility lived with the builders and their immediate managers, the people directly consuming XDS. Upper management had little awareness that the project existed. A system can be load-bearing for an entire company and still be invisible to the people who decide its funding.

The team's structural response was a council, a virtual group of contributors across the company whose job is to preserve context and maintenance capacity so the system can survive shocks like this. They eventually found a new organizational home and revived the central team, but the lesson generalizes. A distributed system needs both technical redundancy and organizational redundancy. If your platform has a single point of failure in the org chart, it doesn't matter how resilient the code is.

Fighting stagnation by moving up the stack

Having saturated design system adoption, the team faced the mature-system trap: drifting into pure maintenance and chasing the long tail of components. Their move was to reapply the bootstrapping playbook with the leverage they'd accumulated. They asked the community what their biggest pain points were, and the answers weren't about components at all. They were about connecting the UI to the backend and surrounding platform, about routing and preloading being too hard, and about the lack of observability needed to debug performance.

So the team expanded from a component library into a full platform: routing infrastructure, tool management and observability, and backend systems for common patterns. They used their codemod expertise to make every tool a first-class citizen on the platform, which let them wire in central data sources and deliver observability features uniformly. The components evolved into full page patterns wired end to end through the backend and routing systems, so a builder gets a working page rather than a set of parts to assemble.

The AI angle closes the loop. Coding LLMs already build UIs competently, but they need to be taught Meta's specific internal conventions. The team's approach is to generate grounding templates the AI modifies, which dovetails with the pattern and backend work already underway. Zhang notes designers and non-technical people have started generating working tools, which is the payoff of having a constrained, well-specified system for a model to target.

The through line across all four challenges is that a UI system at this scale is a distributed systems problem wearing a frontend costume. Throughput limits, safe rollouts, kill switches, schema migration, single points of failure, the vocabulary is identical. The components were the easy part. Coordinating thousands of independent contributors against a single shared state, without breaking the ten thousand things depending on it, is the work.

Comments

Loading comments...