Building the AI Cathedral: How Google Cloud Scales Inference for Billions of Agents and Users
Share this article
When NVIDIA CEO Jensen Huang declared AI is having its "iPhone moment," he captured the transformative potential of the technology—but glossed over the monumental engineering challenge of delivering it to billions. As generative AI infiltrates everything from healthcare to education, the real bottleneck isn't model intelligence; it's inference scalability. Today's systems buckle under the weight of cost, energy, and infrastructure demands, especially with the rise of autonomous AI agents. Google Cloud's decade-long response, a meticulously architected stack, is now emerging as the blueprint for affordable, low-latency AI at global scale.
The Foundation: Why Scaling Inference is AI's Unsolved Crisis
Generative AI's breakneck progress masks a looming crisis: inference for large language models (LLMs) is prohibitively expensive and inefficient. With agentic AI poised to multiply request volumes, traditional load balancing and hardware fall short. As Google engineer Federico Iezzi notes, "Today’s systems won’t scale in both requirements and cost." The symptoms are already visible—spiking latency, GPU shortages, and unsustainable energy footprints. Hyperscalers like Google Cloud are addressing this not with a single silver bullet, but through integrated building blocks designed for the unique pressures of real-time AI.
Inside Google's "Cathedral of Compute": Core Components Unpacked
Scaling to billions requires a symphony of specialized technologies. Google's approach hinges on seven pillars:
GKE Inference Gateway: The linchpin for GenAI workloads, this Kubernetes-native gateway replaces rudimentary load balancing with intelligent routing based on LLM-specific metrics like KV cache utilization and LoRA adapter states. By dynamically directing requests to the least congested model replicas, it slashes tail latency by 60% and boosts throughput by 40%. As Iezzi explains, "It creates a level of predictability analogous to a Real-Time Operating System."
Custom Metrics for Load Balancers: Traditional balancers use CPU or memory, but AI inference demands application-aware routing. Google's solution enables traffic distribution based on queue depth, token latency, and model performance, enabling smarter autoscaling. For instance, requests can be steered away from instances with saturated KV caches—a critical optimization since KV caching reuses precomputed tensors to accelerate token generation.
Hyperscale Networking: Global reach is non-negotiable. Google leverages anycast routing (a single IP address) across 42 regions, ensuring users connect to the nearest edge point. Combined with optical circuit switches in TPU pods, this minimizes latency while maximizing resource utilization.
GKE Custom Compute Classes (CCC): Cost control meets flexibility. CCC lets teams define fallback hierarchies for accelerators—prioritizing reserved instances, then flexible commitments (DWS Flex), and finally spot VMs. This declarative approach ensures inference workloads always land on optimal hardware without budget overruns.
Zero-Friction Observability: Google's out-of-the-box dashboards for NVIDIA GPUs and TPUs provide real-time insights into metrics like HBM usage and tensor core utilization. As Iezzi attests, "You get pretty much everything out-of-the-box, it works like a charm." The new TPU Monitoring Library offers chip-level telemetry, crucial for debugging at scale.
Cloud TPUs: The Secret Weapon: Google's custom silicon, like the Trillium (v6e) and Ironwood (v7) TPUs, isn't about raw flops but interconnect efficiency. Each TPU v7 chip boasts 5.4Tbps bidirectional bandwidth via Inter-Chip Interconnect (ICI) and optical switches—enabling massive parallelism. As Amin Vahdat, Google's infrastructure lead, stated: "> Custom hardware like TPUs delivers 100x better efficiency for AI workloads than general-purpose alternatives."
vLLM & llm-d: The Unifying Layer: Open-source vLLM ties the stack together, supporting GPUs, TPUs, and AMD chips via a unified engine. It enables staggering throughput (e.g., 22,000 tokens/sec) while avoiding vendor lock-in. Building on this, llm-d—a Kubernetes-native project co-developed with Red Hat—disaggregates prefill and decode stages for even greater scale. Iezzi calls it "the reference for billion-user inference," citing its multi-tier KV cache and vLLM-aware scheduler.
Why LoRA and KV Cache Are Game Changers
Two innovations underpin the efficiency gains:
- Low-Rank Adaptation (LoRA): Instead of costly full model fine-tuning, LoRA attaches lightweight adapters to a base LLM—like a "finishing touches kiosk" for specialized tasks. This allows one model to handle myriad custom requests (e.g., multilingual responses) without retraining.
- Key-Value (KV) Caching: By storing intermediate computations during token generation, KV caching cuts redundant work. Google's gateway ensures uniform cache utilization across replicas, preventing bottlenecks that inflate time-to-first-token (TTFT).
The Road to a Trillion-Token Future
Google's decade-long investment positions it uniquely for the agentic AI era. Early adopters like Apple already train foundational models on thousands of TPUs, while projects like Gemini 2.5 demonstrate 93.4% compute efficiency even during elastic scaling. For developers, the tools are now accessible: GKE Inference Gateway is in production, vLLM supports TPUs on GKE, and llm-d is evolving through open-source collaboration. As Iezzi urges, experimentation is key—whether optimizing Gemma 3 on GKE or stress-testing multi-accelerator deployments. The "cathedral" is complete; the next challenge is populating it with innovations that make AI not just powerful, but universally attainable.
Source: Adapted from "Scaling Inference To Billions of Users And AI Agents" by Federico Iezzi on Google Cloud Medium. Original URL: https://medium.com/google-cloud/scaling-inference-to-billions-of-users-and-agents-516d5d9f5da7