Google Announces GKE Agent Sandbox and Hypercluster at Next '26,Positioning Kubernetes as AI Agent Runtime

Google unveiled major Google Kubernetes Engine (GKE) updates at Cloud Next '26 that reframe Kubernetes as a native runtime for AI agents and large-scale model training, including open-source kernel-isolated Agent Sandbox primitives, Hypercluster for managing up to 1 million accelerator chips from a single control plane, and inference optimizations that reduce latency and boost throughput for long-context workloads.

Google used its Cloud Next '26 conference to announce a sweeping set of updates to Google Kubernetes Engine (GKE) that reframe the container orchestration platform as a core operating system for the AI era. The headline features, GKE Agent Sandbox and GKE Hypercluster, address two of the most pressing infrastructure challenges facing organizations building large-scale AI agents and training frontier models: secure, high-throughput execution of untrusted agent code, and management of accelerator fleets that now scale to millions of chips across distributed regions.

The Problem: AI Infrastructure Mismatches

The push to integrate AI agents into production workflows has created a sharp mismatch between existing Kubernetes capabilities and real-world requirements. Multi-agent AI workflows have surged 327% in recent months according to Databricks, while 66% of organizations now rely on Kubernetes to power generative AI applications and agents per CNCF survey data. Agents frequently need to execute untrusted code, whether that is user-provided tooling, dynamically generated workflows, or third-party plugin code, and standard Kubernetes pod isolation is insufficient to prevent escape attacks or cross-tenant data leaks. Existing sandboxing solutions for agents fall into three categories, each with limitations: Cloudflare's recently GA Sandboxes use container-based isolation on its edge network paired with V8 isolate-based Dynamic Workers for lighter workloads, but are proprietary to Cloudflare's platform. E2B uses Firecracker microVMs for isolation, but requires a separate managed service outside standard Kubernetes clusters. For organizations already running Kubernetes, adding a proprietary sandbox layer creates operational silos and vendor lock-in.

A second, unrelated scaling problem has emerged for AI training workloads. As frontier model training requires hundreds of thousands of accelerator chips, organizations have fragmented their infrastructure into hundreds of disconnected Kubernetes clusters to avoid hitting per-cluster scaling limits. This fragmentation creates massive operational overhead: configuration drift across clusters, inconsistent security policies, and no unified way to schedule workloads across the entire accelerator fleet. A single training job that spans multiple clusters requires custom orchestration tooling that most teams do not have the resources to build.

Inference workloads, meanwhile, face their own bottlenecks. Large language models (LLMs) used for agent workflows have strict latency requirements, particularly time-to-first-token for interactive agents, and long-context windows create memory pressure as KV caches grow to tens of gigabytes per request. Standard Kubernetes scheduling for inference uses heuristic-based routing that often misallocates capacity, while KV caches are typically stored in expensive RAM even when slower, cheaper storage tiers would suffice for longer contexts.

Solution Approach: GKE Updates for AI Workloads

Google's updates target each of these problem areas with a mix of open-source primitives, managed scaling features, and performance optimizations.

GKE Agent Sandbox: Open-Source Kernel Isolation for Agents

Google's first headline update targets the agent code execution problem directly. GKE Agent Sandbox uses gVisor, the same user-space kernel sandboxing technology that secures Google's own Gemini models, to provide kernel-level isolation for untrusted agent code. Unlike proprietary sandbox offerings, Agent Sandbox is an open-source Kubernetes SIG Apps subproject, first launched at KubeCon NA 2025, which means it can run on any conformant Kubernetes cluster, not just GKE.

The implementation introduces three new Kubernetes primitives to standardize agent sandboxing:

Sandbox: The core workload resource that defines an isolated execution environment for a single agent task.
SandboxTemplate: A reusable security blueprint that defines isolation rules, resource limits, and gVisor configuration for sandboxes, similar to how PodTemplates work for standard pods.
SandboxClaim: A transactional resource that higher-level agent frameworks such as Google's Agent Development Kit (ADK) or LangChain can use to request execution environments without managing sandbox lifecycle directly.

To reduce cold start latency, Agent Sandbox uses warm pools of pre-provisioned pods that can be allocated in under one second. Google reports the system can spin up 300 sandboxes per second at sub-second latency, with up to 30% better price-performance when running on Google's Axion accelerators compared to other hyperscale cloud providers.

Lovable, a platform that supports more than 200,000 new AI-generated projects daily, is already running production workloads on Agent Sandbox. Lovable co-founder Fabian Hedin said that GKE's sandboxing capabilities let the company reliably scale to hundreds of secure sandboxes per second, even during massive, unpredictable demand spikes that come with supporting 200,000 new AI projects daily.

The agent sandbox market now has three distinct approaches: Cloudflare's edge-native container and V8 isolate sandboxes, E2B's Firecracker microVM sandboxes, and Google's Kubernetes-native gVisor sandboxes. As Alex Gkiouros, a Google Cloud Ambassador and staff architect, observed, GKE Agent Sandbox is currently the only native agent sandbox offering among the three major hyperscalers (AWS, Azure, Google Cloud). The open-source nature is the key differentiator here: organizations that have already standardized on Kubernetes do not need to adopt a separate proprietary platform to get secure agent isolation.

GKE Hypercluster: Single Control Plane for Million-Chip Fleets

The second headline update, GKE Hypercluster, now in private general availability, addresses the cluster fragmentation problem for large-scale training workloads. A single conformant GKE control plane can now manage up to 1 million accelerator chips distributed across 256,000 nodes spanning multiple regions. This eliminates the need to split accelerator fleets into hundreds of small clusters, giving teams a unified API to schedule training jobs across their entire global infrastructure.

Security for Hypercluster relies on Google's Titanium Intelligence Enclave, a hardware-attested "no-admin-access" model. Proprietary model weights, training data, and user prompts are cryptographically sealed from platform administrators, reducing the risk of internal data leaks.

Hypercluster builds on GKE's existing managed control plane, which already handles scaling for large clusters, but extends it to a previously unsupported scale. Google has not disclosed the underlying data store changes required to support 256,000 nodes in a single control plane, but conformance with standard Kubernetes APIs means existing tooling, such as kubectl, custom controllers, and GitOps pipelines, work without modification.

Inference and Scheduling Optimizations

Google also announced several updates targeting inference performance and workload scheduling, all generally available or in preview:

Predictive Latency Boost for GKE Inference Gateway: This feature uses ML-driven routing to replace heuristic-based scheduling for inference requests. It reduces time-to-first-token latency by up to 70% by using real-time capacity data from inference servers to route requests to the optimal node. The feature is built on llm-d, which recently became an official CNCF Sandbox project.
Automatic KV Cache Storage Tiering: This solves long-context memory bottlenecks by automatically tiering KV caches across RAM, Local SSD, and Google Cloud Storage based on access frequency. Google reports a 50% throughput gain for 10K token prompts offloaded to RAM, and a nearly 70% throughput improvement for 50K token prompts offloaded to Local SSD.
RL Scheduler and RL Sandbox: Purpose-built tools for reinforcement learning workloads. RL Scheduler optimizes scheduling for distributed RL training jobs, while RL Sandbox provides kernel-isolated environments for evaluating reward models, complementing the existing Agent Sandbox for training workflows.
Intent-Based Autoscaling: Reduces Horizontal Pod Autoscaler (HPA) reaction times from 25 seconds to 5 seconds by sourcing metrics directly from pods rather than external monitoring stacks. This eliminates the lag between a pod scaling event and the HPA detecting the change, which is critical for bursty agent workloads.

Trade-offs and Industry Context

Google's broader bet with these updates is that Kubernetes itself should be the agent runtime, rather than a separate platform that runs on top of Kubernetes. This is a departure from competitors like Cloudflare, which position their edge network as the primary agent runtime, or E2B, which sells a standalone sandbox service. The open-source Agent Sandbox primitives are a strong differentiator for Google, as they avoid vendor lock-in for organizations that have already invested in Kubernetes. However, gVisor's user-space kernel implementation has historically had syscall compatibility gaps compared to standard Linux kernels, which could limit support for some agent workloads that rely on obscure syscalls. Google claims the implementation is battle-tested from Gemini's production use, but organizations running niche agent tooling should validate compatibility before migrating.

For Hypercluster, the single control plane brings massive operational simplicity, but also a larger blast radius. As Gkiouros noted, a single control plane managing a million chips across regions creates a single point of failure: a bug in the control plane or a misconfigured policy could take down the entire accelerator fleet at once. Google's decision to limit Hypercluster to private GA initially is a pragmatic acknowledgment of this risk, giving select customers time to test change management processes before broader rollout. Multi-region control plane latency is another potential trade-off: nodes in regions far from the control plane may see higher API latency, which could slow down scheduling for time-sensitive training jobs. Google has not disclosed how it mitigates this, but GKE's existing regional control planes use cached API servers in local regions to reduce latency.

The inference updates are more incremental but address clear pain points. Predictive Latency Boost's ML-driven routing is a step up from static heuristic scheduling, but it requires collecting enough telemetry data to train the routing models, which may be a barrier for smaller teams. KV cache tiering is a straightforward win for long-context workloads, but organizations need to tune tiering policies to avoid over-offloading to slow storage tiers that increase per-request latency.

Compared to other hyperscalers, Google is the first to offer a native Kubernetes sandbox for agents, which aligns with its broader push to make GKE the default platform for AI workloads. AWS and Azure have existing sandboxing offerings, but none are integrated as deeply into Kubernetes primitives as Agent Sandbox. AWS's ECS and EKS sandboxing uses Firecracker microVMs via Fargate, but it is not an open-source Kubernetes subproject, and Azure's container sandboxing is proprietary to Azure Kubernetes Service.

About the Author

Author photo

Steef-Jan Wiggers is a senior cloud editor at InfoQ and a Domain Architect at VGZ in the Netherlands. His technical expertise focuses on integration platforms, Azure DevOps, AI, and Azure Platform Solution Architectures. A 16-time Microsoft Azure MVP, he regularly speaks at conferences and user groups, and writes for InfoQ.

#Kubernetes #AI #GKE #sandbox #Infrastructure