Running Ray at Scale on AKS - InfoQ | LavX News

Microsoft and Anyscale partner to solve GPU scarcity, storage fragmentation, and credential expiry for large-scale ML workloads on Azure Kubernetes Service.

The Azure Kubernetes Service (AKS) team at Microsoft has published detailed guidance for running Anyscale's managed Ray service at scale, addressing three critical operational challenges: GPU capacity limits, scattered ML storage, and credential expiry problems. This guidance builds upon earlier work around open-source KubeRay on AKS, now highlighting Anyscale's enhanced runtime (formerly RayTurbo) with smart autoscaling, improved monitoring, and fault-tolerant training features.

GPU Scarcity and Multi-Region Distribution

GPU scarcity represents one of the most significant operational challenges in large-scale ML. High-demand accelerators like NVIDIA GPUs often face quota and availability constraints in Azure regions, delaying cluster setup and job scheduling. Microsoft's solution employs a multi-cluster, multi-region architecture that distributes Ray clusters across different AKS instances in various Azure regions. This approach enables teams to aggregate GPU quota beyond regional limits, automatically reroute workloads during outages or capacity issues, and extend compute pools to on-premises systems or other cloud providers using Azure Arc with AKS.

The Anyscale console provides a unified view of these registered clusters, while Anyscale Workspaces manages workload scheduling using available capacity through either manual or automatic methods. New regions can be added by creating a cloud_resource.yaml manifest and applying it using the Anyscale CLI, making multi-region expansion configuration-driven and manageable.

Solving ML Storage Fragmentation

A common pain point in ML operations involves transferring training data, model checkpoints, and artifacts between pipeline stages—from pre-training to fine-tuning and inference. The guidance addresses this through Azure BlobFuse2, which mounts Azure Blob Storage into Ray worker pods as a POSIX-compatible filesystem. From Ray's perspective, the mount point appears as a local directory, allowing tasks and actors to read datasets and write checkpoints using standard file I/O while BlobFuse2 handles persistence to Azure Blob Storage.

This architecture ensures data availability across pods and node pools, with local caching preventing GPU stalls during large training runs. Because data is decoupled from compute, Ray clusters can scale up and down without data loss. Setup involves enabling the blob CSI driver when creating the cluster, defining a StorageClass that uses workload identity for authentication, and creating a PersistentVolumeClaim with ReadWriteMany access to enable simultaneous access by multiple Ray workers on different nodes.

Authentication Reliability with Workload Identity

Previously, Anyscale and Azure integration relied on CLI tokens or API keys that expired every 30 days, requiring manual rotation and risking service disruption. The new approach uses Microsoft Entra service principals and AKS workload identity to issue short-lived tokens automatically. The Anyscale Kubernetes Operator pod uses a user-assigned managed identity to request access tokens for the Anyscale service principal from Entra ID, with Azure handling token refresh transparently.

This eliminates the need for long-lived credentials stored in the cluster and removes manual rotation requirements. The workload identity model provides fine-grained RBAC for Azure resource access and produces full audit trails through Azure Activity Logs. This is particularly important in multi-cluster environments where manual credential management across many clusters adds significant operational burden.

Industry-Wide Ray Adoption

Microsoft is not alone in this partnership. AWS announced its Anyscale collaboration at Ray Summit 2024, connecting EKS clusters to the RayTurbo runtime and highlighting hardware flexibility by combining NVIDIA GPUs with AWS's Trainium and Inferentia accelerators. Additionally, SageMaker HyperPod now serves as a deployment target for long-running training jobs requiring node-level resilience.

Google Cloud leads in open-source contributions, with the GKE team working alongside Anyscale engineers to upstream label-based scheduling into Ray v2.49, create a ray.util.tpu layer to reduce resource fragmentation in multi-chip TPU setups, and add Dynamic Resource Allocation for new GB200-backed instances. All three hyperscalers have chosen the same managed Ray operator while adding their infrastructure, demonstrating industry preference for Kubernetes-plus-Ray for AI workloads.

The Anyscale on AKS integration is currently in private preview. Teams seeking access should contact their Microsoft account team or file a request on the AKS GitHub repository, including details about Ray workloads and target regions. Example setups and workloads for fine-tuning with DeepSpeed and LLaMA-Factory are available in the Azure-Samples/aks-anyscale repository on GitHub, including LLM inference endpoints.

The competition has shifted from runtime capabilities to which cloud provider can best streamline the surrounding infrastructure, making this partnership a significant development for organizations scaling ML workloads on Azure.